Fail-Safe PVM: A Portable Package for Distributed Programming with Transparent Recovery

León, Juán; Fisher, Allan L.; Steenkiste, Peter

doi:10.21236/ada266594

Cited by 56 publications

(29 citation statements)

References 18 publications

(7 reference statements)

Supporting

Mentioning

Contrasting

Order By: Relevance

“…These approaches typically launch daemons on every node that form and maintain communication groups that allow tracking and managing recovery by maintaining the configuration of the communication system. The failure of any given node in the group is handled by restarting the failed process on a different node, by restructuring the computation, or through transparent migration to another node [2] [13] [51].…”

Section: Operating System and Runtime-based Solutionsmentioning

confidence: 99%

Resilience Design Patterns - A Structured Approach to Resilience at Extreme Scale (version 1.1)

Hukerikar¹,

Engelmann²

2016

View full text Add to dashboard Cite

Reliability is a serious concern for future extreme-scale high-performance computing (HPC) systems. Projections based on the current generation of HPC systems and technology roadmaps suggest that very high fault rates in future systems. The errors resulting from these faults will propagate and generate various kinds of failures, which may result in outcomes ranging from result corruptions to catastrophic application crashes.Practical limits on power consumption in HPC systems will require future systems to embrace innovative architectures, increasing the levels of hardware and software complexities.The resilience challenge for extreme-scale HPC systems requires management of various hardware and software technologies that are capable of handling a broad set of fault models at accelerated fault rates. These techniques must seek to improve resilience at reasonable overheads to power consumption and performance.While the HPC community has developed various solutions, application-level as well as system-based solutions, the solution space of HPC resilience techniques remains fragmented. There are no formal methods and metrics to investigate and evaluate resilience holistically in HPC systems that consider impact scope, handling coverage, and performance & power efficiency across the system stack. Additionally, few of the current approaches are portable to newer architectures and software ecosystems, which are expected to be deployed on future systems.In this document, we develop a structured approach to the management of HPC resilience based on the concept of resilience-based design patterns. A design pattern is a general repeatable solution to a commonly occurring problem. We identify the commonly occurring problems and solutions used to deal with faults, errors and failures in HPC systems. The catalog of resilience design patterns provides designers with reusable design elements. We define a design framework that enhances our understanding of the important constraints and opportunities for solutions deployed at various layers of the system stack. The framework may be used to establish mechanisms and interfaces to coordinate flexible fault management across hardware and software components. The framework also enables optimization of the cost-benefit trade-offs among performance, resilience, and power consumption. The overall goal of this work is to enable a systematic methodology for the design and evaluation of resilience technologies in extreme-scale HPC systems that keep scientific applications running to a correct solution in a timely and cost-efficient manner in spite of frequent faults, errors, and failures of various types.

show abstract

Section: Operating System and Runtime-based Solutionsmentioning

confidence: 99%

Resilience Design Patterns - A Structured Approach to Resilience at Extreme Scale (version 1.1)

Hukerikar¹,

Engelmann²

2016

View full text Add to dashboard Cite

show abstract

“…This can be done by a consistent checkpointing scheme that saves a global checkpoint to a central file server at very coarse intervals (for example, once every hour or day). Such checkpointing schemes are straightforward and have been discussed and implemented elsewhere [11,17,18,26,29,36,38,43].…”

Section: A Model For Scientific Programs That Live On a Nowmentioning

confidence: 99%

“…In other words, rather than implement checkpointing transparently as in MIST [11], Fail-Safe PVM [29], or CoCheck [43], we hardwire it into the program. This is beneficial for several reasons.…”

Section: The Checkpointing Algorithmmentioning

confidence: 99%

Fault-Tolerant Matrix Operations for Networks of Workstations Using Diskless Checkpointing

Plank

Kim

Dongarra

1997

Journal of Parallel and Distributed Computing

View full text Add to dashboard Cite

Networks of workstations (NOWs) offer a cost-effective platform for high-performance, long-running parallel computations. However, these computations must be able to tolerate the changing and often faulty nature of NOW environments. We present high-performance implementations of several fault-tolerant algorithms for distributed scientific computing. The fault-tolerance is based on diskless checkpointing, a paradigm that uses processor redundancy rather than stable storage as the fault-tolerant medium. These algorithms are able to run on clusters of workstations that change over time due to failure, load, or availability. As long as there are at least n processors in the cluster, and failures occur singly, the computation will complete in an efficient manner. We discuss the details of how the algorithms are tuned for fault-tolerance and present the performance results on a PVM network of Sun workstations connected by a fast, switched ethernet.

show abstract

“…fault-tolerant networks and system reconfiguration after a fault. There has been some though, for example, FT-Linda [4], PLinda [15], Orca [16], Calypso [5], and Fail-safe PVM [17]. These systems use a combination of well known mechanisms such as replication, transactions, message logging, or checkpoints and rollbacks to provide fault-tolerance.…”

Section: Related Workmentioning

confidence: 99%

Fault tolerance via replication in coarse grain data-flow

Nguyen-Tuong

Grimshaw

Karpovich

1996

Parallel Symbolic Languages and Systems

View full text Add to dashboard Cite

Fail-Safe PVM: A Portable Package for Distributed Programming with Transparent Recovery

Cited by 56 publications

References 18 publications

Resilience Design Patterns - A Structured Approach to Resilience at Extreme Scale (version 1.1)

Resilience Design Patterns - A Structured Approach to Resilience at Extreme Scale (version 1.1)

Fault-Tolerant Matrix Operations for Networks of Workstations Using Diskless Checkpointing

Fault tolerance via replication in coarse grain data-flow

Contact Info

Product

Resources

About