ACM/IEEE SC 2006 Conference (SC'06) 2006
DOI: 10.1109/sc.2006.5
|View full text |Cite
|
Sign up to set email alerts
|

A Software Based Approach for Providing Network Fault Tolerance in Clusters with uDAPL interface: MPI Level Design and Performance Evaluation

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
3
1
1

Citation Types

0
7
0

Year Published

2007
2007
2013
2013

Publication Types

Select...
3
3
1

Relationship

1
6

Authors

Journals

citations
Cited by 12 publications
(7 citation statements)
references
References 3 publications
0
7
0
Order By: Relevance
“…There are several published studies on multi-method MPIs, including [4,11,12,17,30,36]. Most of these assume static configurations of available communication methods.…”
Section: Related Workmentioning
confidence: 99%
See 1 more Smart Citation
“…There are several published studies on multi-method MPIs, including [4,11,12,17,30,36]. Most of these assume static configurations of available communication methods.…”
Section: Related Workmentioning
confidence: 99%
“…Most of these assume static configurations of available communication methods. Some of them support switching communication methods at runtime, but the main purpose is network fail-over [11,12,36]. MVAPICH2-ivc is designed for an environment where available communication methods may change due to migration.…”
Section: Related Workmentioning
confidence: 99%
“…In our previous work, we have designed MPI-2 one sided communication using multi-rail InfiniBand networks [14]. Handling network heterogeneity and network faults with asynchronous recovery of previously failed paths has also been presented [13]. However, the above works have focused on design and evaluation with multi-rail networks on the end nodes (multiple ports, multiple adapters), rather than the network.…”
Section: Related Workmentioning
confidence: 99%
“…A network or machine failure can be detected by checking the completion queue entries. In [1,2], a similar method was used to detect network failure.…”
Section: Active Detection Of a Machine Crash Or Network Failurementioning
confidence: 99%
“…The design rationale for these studies has been that reliable remote memory connected with high speed interconnects are better than a single big machine in terms of cost-effectiveness. 1 1 The widespread architecture adopted by vendors showing top ten TPC-C results is a clustered architecture, not a big mainframe.…”
Section: Introductionmentioning
confidence: 99%