A Fault Tolerant MPI-IO Implementation using the Expand Parallel File System

Calderón, Alejandro; Garcı́a-Carballeira, Félix; Carretero, J.; Perez, J.M.; Sánchez, Luis Miguel

doi:10.1109/empdp.2005.3

Cited by 3 publications

(4 citation statements)

References 9 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…By using different storage spaces for user data and redundant information, Expand can add and remove fault-tolerant support on a file basis [15].…”

Section: Fault Tolerant Model Typesmentioning

confidence: 99%

“…In a previous paper in bibliography [15], we presented the implementation and evaluation of the first prototype, with different fault tolerant schemes and using MPI-IO interface. In this paper, we present the design and the initial framework behind the fault tolerant architecture of Expand.…”

Section: Introductionmentioning

confidence: 99%

“…In Expand, fault tolerance is provided at file level [14,15], because not all files have the same requirements of fault tolerance degree and performance. In fact, the number of critical files for a user during a period of time is a small part of all files that belong to him, and this subset of files changes along the time.…”

Section: Introductionmentioning

confidence: 99%

See 2 more Smart Citations

Fault tolerant file models for parallel file systems: introducing distribution patterns for every file

et al. 2008

View full text Add to dashboard Cite

Parallelism in file systems is obtained by using several independent server nodes supporting one or more secondary storage devices. This approach increases the performance and scalability of the system, but a fault in one single node can stop the whole system. To avoid this problem, data must be stored using some kind of redundant technique, so any data stored in a faulty element can be recovered. Fault tolerance can be provided in I/O systems by using replication or RAID based schemes. However, most of the current systems apply the same technique for all files in the system. This paper describes the fault tolerance support provided by Expand, a parallel file system based on standard servers. This support can be applied to other parallel file systems with many benefices: fault tolerance at file level, flexible definition of fault tolerance scheme to be used, possibility to change the fault tolerant support used for a file, etc.

show abstract

“…By using different storage spaces for user data and redundant information, Expand can add and remove fault-tolerant support on a file basis [15].…”

Section: Fault Tolerant Model Typesmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

Fault tolerant file models for parallel file systems: introducing distribution patterns for every file

et al. 2008

View full text Add to dashboard Cite

show abstract

“…In the Expand parallel file system [2], each file can have its own fault tolerance policy (no fault tolerance, replication, parity based redundancy, etc.). This can be seen as similar, however more powerful, to the difference we make between files involved in a checkpoint (which are replicated) and other files (which are not).…”

Section: Related Workmentioning

confidence: 99%

Handling Persistent States in Process Checkpoint/Restart Mechanisms for HPC Systems

Riteau

Lèbre

Morin

2009

2009 9th IEEE/ACM International Symposium on Cluster Computing and the Grid

View full text Add to dashboard Cite

Computer clusters are today the reference architecture for highperformance computing. The large number of nodes in these systems induces a high failure rate. This makes fault tolerance mechanisms, e.g. process checkpoint/restart, a required technology to effectively exploit clusters. Most of the process checkpoint/restart implementations only handle volatile states and do not take into account persistent states of applications, which can lead to incoherent application restarts. In this paper, we introduce an efficient persistent state checkpoint/restoration approach that can be interconnected with a large number of file systems. To avoid the performance issues of a stable support relying on synchronous replication mechanisms, we present a failure resilience scheme optimized for such persistent state checkpointing techniques in a distributed environment. First evaluations of our implementation in the kDFS distributed file system show the negligible performance impact of our proposal.

show abstract