2005
DOI: 10.1007/11407522_13
|View full text |Cite
|
Sign up to set email alerts
|

Performance Implications of Failures in Large-Scale Cluster Scheduling

Abstract: Abstract. As we continue to evolve into large-scale parallel systems, many of them employing hundreds of computing engines to take on mission-critical roles, it is crucial to design those systems anticipating and accommodating the occurrence of failures. Failures become a commonplace feature of such large-scale systems, and one cannot continue to treat them as an exception. Despite the current and increasing importance of failures in these systems, our understanding of the performance impact of these critical … Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
2
1

Citation Types

1
70
0

Year Published

2007
2007
2022
2022

Publication Types

Select...
5
2

Relationship

0
7

Authors

Journals

citations
Cited by 77 publications
(71 citation statements)
references
References 41 publications
(59 reference statements)
1
70
0
Order By: Relevance
“…A bursty arrival breaks an important assumption made by numerous fault tolerant algorithms [8,20,15], that of independent and identical distribution of failures among the components of the system. However, few studies [19,4,11] investigate the bursty arrival of failures for distributed systems.…”
Section: Introductionmentioning
confidence: 99%
See 1 more Smart Citation
“…A bursty arrival breaks an important assumption made by numerous fault tolerant algorithms [8,20,15], that of independent and identical distribution of failures among the components of the system. However, few studies [19,4,11] investigate the bursty arrival of failures for distributed systems.…”
Section: Introductionmentioning
confidence: 99%
“…The importance of spacecorrelated failures has been repeatedly noted: the availability of a distributed system may be overestimated by an order of magnitude when as few as 10% of the failures are correlated [19], and a halving of the work loss may be achieved when taking into account space-correlated failures [20].…”
Section: Introductionmentioning
confidence: 99%
“…Studies of real systems, however, show that failures are correlated temporally and spatially, are not identically distributed. Furthermore, the behavior of checkpointing schemes under these realistic failure distributions does not follow the behavior predicted by standard checkpointing models [2,6].…”
Section: Related Workmentioning
confidence: 88%
“…The latter possess a varying failure and restore behavior, which is modelled to mimic reality as much as possible. As outlined by Zhang et al [6], failures in large-scale distributed systems are mostly correlated and tend to occur in bursts. Besides, there are strong spatial correlations between failures and nodes, where a small fraction of the nodes incur most of the failures.…”
Section: The System Modelmentioning
confidence: 99%
“…Pausing, resuming, and migration VMs [22], [23] are powerful mechanisms to manage failures in such situations. The checkpointing and rollback recovery technique [24] has been widely used in distributed systems.…”
Section: Related Workmentioning
confidence: 99%