13th Euromicro Conference on Parallel, Distributed and Network-Based Processing
DOI: 10.1109/empdp.2005.3
|View full text |Cite
|
Sign up to set email alerts
|

A Fault Tolerant MPI-IO Implementation using the Expand Parallel File System

Abstract: Parallelism in file systems is obtained by using several independent server nodes supporting one or more secondary storage devices. This approach increases the performance and scalability of the system, but a fault in one single node can stop the whole system. To avoid this problem, data must be stored using some kind of redundant technique, so any data stored in a faulty element can be recovered. Fault tolerance can be provided in I/O systems using replication or RAID based schemes. However, most of the curre… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1

Citation Types

0
4
0

Publication Types

Select...
1
1
1

Relationship

0
3

Authors

Journals

citations
Cited by 3 publications
(4 citation statements)
references
References 9 publications
0
4
0
Order By: Relevance
“…By using different storage spaces for user data and redundant information, Expand can add and remove fault-tolerant support on a file basis [15].…”
Section: Fault Tolerant Model Typesmentioning
confidence: 99%
See 2 more Smart Citations
“…By using different storage spaces for user data and redundant information, Expand can add and remove fault-tolerant support on a file basis [15].…”
Section: Fault Tolerant Model Typesmentioning
confidence: 99%
“…In a previous paper in bibliography [15], we presented the implementation and evaluation of the first prototype, with different fault tolerant schemes and using MPI-IO interface. In this paper, we present the design and the initial framework behind the fault tolerant architecture of Expand.…”
Section: Introductionmentioning
confidence: 99%
See 1 more Smart Citation
“…In the Expand parallel file system [2], each file can have its own fault tolerance policy (no fault tolerance, replication, parity based redundancy, etc.). This can be seen as similar, however more powerful, to the difference we make between files involved in a checkpoint (which are replicated) and other files (which are not).…”
Section: Related Workmentioning
confidence: 99%