SC14: International Conference for High Performance Computing, Networking, Storage and Analysis 2014
DOI: 10.1109/sc.2014.54
|View full text |Cite
|
Sign up to set email alerts
|

Fail-in-Place Network Design: Interaction Between Topology, Routing Algorithm and Failures

Abstract: Abstract-The growing system size of high performance computers results in a steady decrease of the mean time between failures. Exchanging network components often requires whole system downtime which increases the cost of failures. In this work, we study a fail-in-place strategy where broken network elements remain untouched. We show, that a fail-in-place strategy is feasible for todays networks and the degradation is manageable, and provide guidelines for the design. Our network failure simulation toolchain a… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1
1

Citation Types

0
16
0

Year Published

2015
2015
2019
2019

Publication Types

Select...
5
1
1

Relationship

1
6

Authors

Journals

citations
Cited by 23 publications
(16 citation statements)
references
References 30 publications
0
16
0
Order By: Relevance
“…Fig.2 shows the architecture of the proposed combination of OFS and OMNeT++. In general, we have followed similar ideas to those proposed in [10], [14], but we introduce several improvements. The main contributions of this methodology are the following:…”
Section: Methodology Descriptionmentioning
confidence: 99%
See 2 more Smart Citations
“…Fig.2 shows the architecture of the proposed combination of OFS and OMNeT++. In general, we have followed similar ideas to those proposed in [10], [14], but we introduce several improvements. The main contributions of this methodology are the following:…”
Section: Methodology Descriptionmentioning
confidence: 99%
“…Next, we use the IB tools included in OFS to obtain the information needed to build the network in the simulation tool. For the time being, we have put efforts in the development of a software layer which integrates OFS and the network simulators proposed in [8], [9] and [10]. The rest of this paper is organized as follows: Section II shows an overview of the InfiniBand architecture.…”
Section: Motivationmentioning
confidence: 99%
See 1 more Smart Citation
“…Then, it Component AFR MTTF Reliability Network [12,17] 1.00% 876, 000 4-nines NIC [12,17] 1.00% 876, 000 4-nines DRAM [18] 39.5% 22, 177 2-nines CPU [18] 41.9% 20, 906 2-nines Server [17,39] 47.9% 18, 304 2-nines Table 2: Worst case scenario reliability data. The reliability is estimated over a period of 24 hours and expressed in the "nines" notation; the MTTF is expressed in hours.…”
Section: Dare: Safety and Livenessmentioning
confidence: 99%
“…Various sources provide failure data of systems and system components [12,18,31,36]. Yet, systems range from very reliable ones with AFRs per component below 0.2% [31] to relatively unreliable ones with component failure log events at an annual rate of more than 40% [18] (here we assume that a logged error impacted the function of the device).…”
Section: Fine-grained Failure Modelmentioning
confidence: 99%