International Conference on Dependable Systems and Networks (DSN'06)
DOI: 10.1109/dsn.2006.5
|View full text |Cite
|
Sign up to set email alerts
|

A large-scale study of failures in high-performance computing systems

Abstract: Designing highly dependable systems requires a good understanding of failure characteristics. Unfortunately, little raw data on failures in large IT installations is publicly available. This paper analyzes failure data recently made publicy available by one of the largest high-performance computing sites. The data has been collected over the past 9 years at Los Alamos National Laboratory and includes 23000 failures recorded on more than 20 different systems, mostly large clusters of SMP and NUMA nodes. We stud… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
3
2

Citation Types

11
412
1
1

Publication Types

Select...
5
1
1

Relationship

0
7

Authors

Journals

citations
Cited by 507 publications
(425 citation statements)
references
References 20 publications
11
412
1
1
Order By: Relevance
“…However, few studies [19,4,11] investigate the bursty arrival of failures for distributed systems. Even for these studies, the findings are based on data corresponding to a single system-until the recent creation of online repositories such as the failure Failure Trace Archive [13] and the Computer Failure Data Repository [18], failure data for distributed systems were largely inaccessible to the researchers in this area.…”
Section: Introductionmentioning
confidence: 99%
See 1 more Smart Citation
“…However, few studies [19,4,11] investigate the bursty arrival of failures for distributed systems. Even for these studies, the findings are based on data corresponding to a single system-until the recent creation of online repositories such as the failure Failure Trace Archive [13] and the Computer Failure Data Repository [18], failure data for distributed systems were largely inaccessible to the researchers in this area.…”
Section: Introductionmentioning
confidence: 99%
“…The deployment of these techniques and the design of new ones depend on understanding the characteristics of failures in real systems. While many failure models have been proposed for various computer systems [19,17,18,9], few consider the occurrence of failure bursts. In this work we present a new model that focuses on failure bursts, and validate it with real failure traces coming from a diverse set of distributed systems.…”
Section: Introductionmentioning
confidence: 99%
“…However, the shape and scale parameters are different for each study. Nurmi et al [18] and Schroeder et al [22] report that the shape parameter is less than 1, which means that the hazard rates (the frequency a system or component fails) decrease with time. Whereas, Iosup et al [10] report that the shape parameter is greater than 1, which indicates an increasing hazard rate over time.…”
Section: A Resource Reliability Modelmentioning
confidence: 99%
“…Recent studies [20], [10], [22], [18] show that the mean time between failures (MTBF) on modern high performance clusters is best modeled by a Weibull distribution [25]. However, the shape and scale parameters are different for each study.…”
Section: A Resource Reliability Modelmentioning
confidence: 99%
See 1 more Smart Citation