A large-scale study of failures in high-performance computing systems

Schroeder, Bianca; Gibson, Garth A.

doi:10.1109/dsn.2006.5

Cited by 507 publications

(425 citation statements)

References 20 publications

Supporting

Mentioning

412

Contrasting

Unclassified

Order By: Relevance

“…However, few studies [19,4,11] investigate the bursty arrival of failures for distributed systems. Even for these studies, the findings are based on data corresponding to a single system-until the recent creation of online repositories such as the failure Failure Trace Archive [13] and the Computer Failure Data Repository [18], failure data for distributed systems were largely inaccessible to the researchers in this area.…”

Section: Introductionmentioning

confidence: 99%

“…The deployment of these techniques and the design of new ones depend on understanding the characteristics of failures in real systems. While many failure models have been proposed for various computer systems [19,17,18,9], few consider the occurrence of failure bursts. In this work we present a new model that focuses on failure bursts, and validate it with real failure traces coming from a diverse set of distributed systems.…”

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

A Model for Space-Correlated Failures in Large-Scale Distributed Systems

Gallet¹,

Yigitbasi²,

Javadi³

et al. 2010

Lecture Notes in Computer Science

View full text Add to dashboard Cite

Distributed systems such as grids, peer-to-peer systems, and even Internet DNS servers have grown significantly in size and complexity in the last decade. This rapid growth has allowed distributed systems to serve a large and increasing number of users, but has also made resource and system failures inevitable. Moreover, perhaps as a result of system complexity, in distributed systems a single failure can trigger within a short time span several more failures, forming a group of time-correlated failures. To eliminate or alleviate the significant effects of failures on performance and functionality, the techniques for dealing with failures require good failure models. However, not many such models are available, and the available models are valid for few or even a single distributed system. In contrast, in this work we propose a model that considers groups of time-correlated failures and is valid for many types of distributed systems. Our model includes three components, the group size, the group inter-arrival time, and the resource downtime caused by the group. To validate this model, we use failure traces corresponding to fifteen distributed systems. We find that space-correlated failures are dominant in terms of resource downtime in seven of the fifteen studied systems. For each of these seven systems, we provide a set of model parameters that can be used in research studies or for tuning distributed systems. Last, as a result of our work six of the studied traces have been made available through the Failure Trace Archive

show abstract

Section: Introductionmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

A Model for Space-Correlated Failures in Large-Scale Distributed Systems

Gallet¹,

Yigitbasi²,

Javadi³

et al. 2010

Lecture Notes in Computer Science

View full text Add to dashboard Cite

show abstract

“…However, the shape and scale parameters are different for each study. Nurmi et al [18] and Schroeder et al [22] report that the shape parameter is less than 1, which means that the hazard rates (the frequency a system or component fails) decrease with time. Whereas, Iosup et al [10] report that the shape parameter is greater than 1, which indicates an increasing hazard rate over time.…”

Section: A Resource Reliability Modelmentioning

confidence: 99%

“…Recent studies [20], [10], [22], [18] show that the mean time between failures (MTBF) on modern high performance clusters is best modeled by a Weibull distribution [25]. However, the shape and scale parameters are different for each study.…”

Section: A Resource Reliability Modelmentioning

confidence: 99%

“…Whereas, Iosup et al [10] report that the shape parameter is greater than 1, which indicates an increasing hazard rate over time. Hence, we wanted to explore both regions for the shape parameter in our study and created two sets of reliability configurations -one set with shape parameter ranging between 0.5 and 0.9 according to [22] and the other set with shape parameter ranging between 10 and 13 according to [10]. Each set comprises of three reliability characteristics based on the quality of the resource (from a reliability standpoint) -stable, normal and shaky.…”

Section: A Resource Reliability Modelmentioning

confidence: 99%

See 1 more Smart Citation

Combined Fault Tolerance and Scheduling Techniques for Workflow Applications on Computational Grids

Yang

Mandal

Koelbel

et al. 2009

2009 9th IEEE/ACM International Symposium on Cluster Computing and the Grid

View full text Add to dashboard Cite

More and more complex scientific workflows are now executed on computational grids. In addition to the challenges of managing and scheduling these workflows, additional reliability challenges arise because of the unreliable nature of large-scale grid infrastructure. Fault tolerance mechanisms like over-provisioning and checkpoint-recovery are used in current grid application management systems to address these reliability challenges. In this work, we propose new approaches that combine these fault tolerance techniques with existing workflow scheduling algorithms. We present a study on the effectiveness of the combined approaches by analyzing their impact on the reliability of workflow execution, workflow performance and resource usage under different reliability models, failure prediction accuracies and workflow application types.

show abstract

Active optimistic and distributed message logging for message‐passing applications

Ropars

Morin

2011

Concurrency and Computation

View full text Add to dashboard Cite

International audienceMessage logging is an attractive solution to provide fault tolerance for message-passing applications because it is more scalable than coordinated checkpointing. Sender-based message logging is a well-known optimization that allows the saving of message payload in the sender memory. Thus, only message reception events have to be logged reliably by using an event logger. This paper proposes solutions to further improve message logging protocol scalability. In existing works on message logging, the event logger has always been considered as a centralized process. We propose a distributed event logger that takes advantage of multi-core processors that are to be executed in parallel with application processes, leveraging the volatile memory of the nodes to save events reliably. We also propose the combination of our distributed event logger and O2P, an active optimistic message logging protocol using a gossip-based protocol to disseminate information on new stable events. Our distributed event logger and O2P are implemented in the Open MPI library. Our results show the following: (i) distributed event logging improves message logging protocol scalability and (ii) using O2P with a distributed event logger provides an efficient and scalable fault-tolerant solution for message-passing applications

show abstract

A large-scale study of failures in high-performance computing systems

Cited by 507 publications

References 20 publications

A Model for Space-Correlated Failures in Large-Scale Distributed Systems

A Model for Space-Correlated Failures in Large-Scale Distributed Systems

Combined Fault Tolerance and Scheduling Techniques for Workflow Applications on Computational Grids

Active optimistic and distributed message logging for message‐passing applications

Contact Info

Product

Resources

About