Performance Implications of Failures in Large-Scale Cluster Scheduling

Zhang, Yanyong; Squillante, Mark S.; Sivasubramaniam, Anand; Sahoo, Ramendra K.

doi:10.1007/11407522_13

Cited by 77 publications

(71 citation statements)

References 41 publications

(59 reference statements)

Supporting

Mentioning

Contrasting

Order By: Relevance

“…A bursty arrival breaks an important assumption made by numerous fault tolerant algorithms [8,20,15], that of independent and identical distribution of failures among the components of the system. However, few studies [19,4,11] investigate the bursty arrival of failures for distributed systems.…”

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

A Model for Space-Correlated Failures in Large-Scale Distributed Systems

Gallet¹,

Yigitbasi²,

Javadi³

et al. 2010

Lecture Notes in Computer Science

View full text Add to dashboard Cite

Distributed systems such as grids, peer-to-peer systems, and even Internet DNS servers have grown significantly in size and complexity in the last decade. This rapid growth has allowed distributed systems to serve a large and increasing number of users, but has also made resource and system failures inevitable. Moreover, perhaps as a result of system complexity, in distributed systems a single failure can trigger within a short time span several more failures, forming a group of time-correlated failures. To eliminate or alleviate the significant effects of failures on performance and functionality, the techniques for dealing with failures require good failure models. However, not many such models are available, and the available models are valid for few or even a single distributed system. In contrast, in this work we propose a model that considers groups of time-correlated failures and is valid for many types of distributed systems. Our model includes three components, the group size, the group inter-arrival time, and the resource downtime caused by the group. To validate this model, we use failure traces corresponding to fifteen distributed systems. We find that space-correlated failures are dominant in terms of resource downtime in seven of the fifteen studied systems. For each of these seven systems, we provide a set of model parameters that can be used in research studies or for tuning distributed systems. Last, as a result of our work six of the studied traces have been made available through the Failure Trace Archive

show abstract

Section: Introductionmentioning

confidence: 99%

“…The importance of spacecorrelated failures has been repeatedly noted: the availability of a distributed system may be overestimated by an order of magnitude when as few as 10% of the failures are correlated [19], and a halving of the work loss may be achieved when taking into account space-correlated failures [20].…”

Section: Introductionmentioning

confidence: 99%

A Model for Space-Correlated Failures in Large-Scale Distributed Systems

Gallet¹,

Yigitbasi²,

Javadi³

et al. 2010

Lecture Notes in Computer Science

View full text Add to dashboard Cite

show abstract

“…Studies of real systems, however, show that failures are correlated temporally and spatially, are not identically distributed. Furthermore, the behavior of checkpointing schemes under these realistic failure distributions does not follow the behavior predicted by standard checkpointing models [2,6].…”

Section: Related Workmentioning

confidence: 88%

“…The latter possess a varying failure and restore behavior, which is modelled to mimic reality as much as possible. As outlined by Zhang et al [6], failures in large-scale distributed systems are mostly correlated and tend to occur in bursts. Besides, there are strong spatial correlations between failures and nodes, where a small fraction of the nodes incur most of the failures.…”

Section: The System Modelmentioning

confidence: 99%

Providing Fault-Tolerance in Unreliable Grid Systems Through Adaptive Checkpointing and Replication

Chtepen

Claeys

Dhoedt

et al. 2007

Computational Science – ICCS 2007

View full text Add to dashboard Cite

Abstract.As grids typically consist of autonomously managed subsystems with strongly varying resources, fault-tolerance forms an important aspect of the scheduling process of applications. Two well-known techniques for providing fault-tolerance in grids are periodic task checkpointing and replication. Both techniques mitigate the amount of work lost due to changing system availability but can introduce significant runtime overhead. The latter largely depends on the length of checkpointing interval and the chosen number of replicas, respectively. This paper presents a dynamic scheduling algorithm that switches between periodic checkpointing and replication to exploit the advantages of both techniques and to reduce the overhead. Furthermore, several novel heuristics are discussed that perform on-line adaptive tuning of the checkpointing period based on historical information on resource behavior. Simulationbased comparison of the proposed combined algorithm versus traditional strategies based on checkpointing and replication only, suggests significant reduction of average task makespan for systems with varying load.

show abstract

“…Pausing, resuming, and migration VMs [22], [23] are powerful mechanisms to manage failures in such situations. The checkpointing and rollback recovery technique [24] has been widely used in distributed systems.…”

Section: Related Workmentioning

confidence: 99%

IMCLA: Performance Evaluation of Integrated Multilevel Checkpointing Algorithms using Checkpointing Efficiency

Singh¹,

Chhabra²,

Singh³

2013

Int. J. Com. Dig. Sys.

View full text Add to dashboard Cite

Main objective of this research work is to improve the checkpoint efficiency for integrated multilevel checkpointing algorithms (IMLCA) and prevent checkpointing from becoming the bottleneck of cloud data centers. In order to find an efficient checkpoint interval, checkpointing overheads has also considered in this paper. Traditional checkpointing methods stores persistently snapshots of the present job state and use them for resuming the execution at a later time. The attention of this research is strategies for deciding when and whether a checkpoint should be taken and evaluating them in regard to minimizing the induced monetary costs. By varying rerun time of checkpoints performance comparisons are which will be used to evaluate optimal checkpoint interval. The purposed fail-over strategy will work on application layer and provide highly availability for Platform as a Service (PaaS) feature of cloud computing.

show abstract

Performance Implications of Failures in Large-Scale Cluster Scheduling

Cited by 77 publications

References 41 publications

A Model for Space-Correlated Failures in Large-Scale Distributed Systems

A Model for Space-Correlated Failures in Large-Scale Distributed Systems

Providing Fault-Tolerance in Unreliable Grid Systems Through Adaptive Checkpointing and Replication

IMCLA: Performance Evaluation of Integrated Multilevel Checkpointing Algorithms using Checkpointing Efficiency

Contact Info

Product

Resources

About