ABARIS: An Adaptable Fault Detection/Recovery Component Framework for MPIs

Jitsumoto, Hideyuki; Endo, Tetsuro; Matsuoka, Satoshi

doi:10.1109/ipdps.2007.370603

Cited by 12 publications

(11 citation statements)

References 14 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…This paper focuses on the former techniques. It builds on recently developed techniques such as checkpointing (with restarts or rollbacks) or redundant computing in highperformance computing [3,6,13,21,34] or API extensions for checkpointing [15,24]. A common challenge of transparent resiliency lies in the detection of faults, which is also a requirement for fault-awareness as proposed in MPI-3 [1].…”

Section: Introductionmentioning

confidence: 99%

Assessing HPC Failure Detectors for MPI Jobs

Kharbas

Kim

Hoefler

et al. 2012

2012 20th Euromicro International Conference on Parallel, Distributed and Network-Based Processing

View full text Add to dashboard Cite

show abstract

Section: Introductionmentioning

confidence: 99%

Assessing HPC Failure Detectors for MPI Jobs

Kharbas

Kim

Hoefler

et al. 2012

2012 20th Euromicro International Conference on Parallel, Distributed and Network-Based Processing

View full text Add to dashboard Cite

show abstract

“…MPI implementations must improve their ability to handle failures, such as broken connections and dead processes, to the extent possible. A number of research efforts in faulttolerant MPI implementation exist [2,7,13]. However, production MPI implementations need to improve their support for fault tolerance.…”

Section: Fault Tolerancementioning

confidence: 99%

Open Issues in MPI Implementation

Thakur

Gropp

Advances in Computer Systems Architecture

View full text Add to dashboard Cite

Abstract. MPI (the Message Passing Interface) continues to be the dominant programming model for parallel machines of all sizes, from small Linux clusters to the largest parallel supercomputers such as IBM Blue Gene/L and Cray XT3. Although the MPI standard was released more than 10 years ago and a number of implementations of MPI are available from both vendors and research groups, MPI implementations still need improvement in many areas. In this paper, we discuss several such areas, including performance, scalability, fault tolerance, support for debugging and verification, topology awareness, collective communication, derived datatypes, and parallel I/O. We also present results from experiments with several MPI implementations (MPICH2, Open MPI, Sun, IBM) on a number of platforms (Linux clusters, Sun and IBM SMPs) that demonstrate the need for performance improvement in one-sided communication and support for multithreaded programs.

show abstract

“…This system could initiate different recovery actions designed to remedy different fault types. Jitsumoto and colleagues [44] also developed a detector that differentiated between hardware, process, and transmission faults. Here, users were allowed to pre-select a recovery procedure to be invoked in response to occurrence of a particular fault type.…”

Section: Detecting Faults and Failures In Grid Resourcesmentioning

confidence: 99%

“…If the host processor has not failed, temporal redundancy can be used to roll back and restart the process on the same platform. As in other systems, this method is widely used in grids [16,44,47]. However, if the host has failed, the process may be migrated, or transferred, to a different execution environment.…”

Section: Checkpoint and Recoverymentioning

confidence: 99%

“…Here, the results also indicated that blocking reduced efficiency, while a non-blocking approach appeared to scale well but suffered from implementation issues. Coordinated checkpointing has been implemented in MPI, in LAM/MPI [65] as well as in [44,62,66,67]. Yeom et al [66,67] proposed a faulttolerant version of MPI, MPICH-GF, for grid systems that employed coordinated checkpoints with blocking.…”

Section: Checkpoint and Recoverymentioning

confidence: 99%

See 1 more Smart Citation

Reliability of Grid Computing Systems

Computing System Reliability

View full text Add to dashboard Cite

cdabrowski@nist.gov SUMMARYIn recent years, grid technology has emerged as an important tool for solving computeintensive problems within the scientific community and in industry. To further the development and adoption of this technology, researchers and practitioners from different disciplines have collaborated to produce standard specifications for implementing largescale, interoperable grid systems. The focus of this activity has been the Open Grid Forum, but other standards development organizations have also produced specifications that are used in grid systems. To date, these specifications have provided the basis for a growing number of operational grid systems used in scientific and industrial applications. However, if the growth of grid technology is to continue, it will be important that grid systems also provide high reliability. In particular, it will be critical to ensure that grid systems are reliable as they continue to grow in scale, exhibit greater dynamism, and become more heterogeneous in composition. Ensuring grid system reliability in turn requires that the specifications used to build these systems fully support reliable grid services. This study surveys work on grid reliability that has been done in recent years and reviews progress made toward achieving these goals. The survey identifies important issues and problems that researchers are working to overcome in order to develop reliability methods for large-scale, heterogeneous, dynamic environments. The survey also illuminates reliability issues relating to standard specifications used in grid systems, identifying existing specifications that may need to be evolved and areas where new specifications are needed to better support reliability.

show abstract

ABARIS: An Adaptable Fault Detection/Recovery Component Framework for MPIs

Cited by 12 publications

References 14 publications

Assessing HPC Failure Detectors for MPI Jobs

Assessing HPC Failure Detectors for MPI Jobs

Open Issues in MPI Implementation

Reliability of Grid Computing Systems

Contact Info

Product

Resources

About