18th International Parallel and Distributed Processing Symposium, 2004. Proceedings.
DOI: 10.1109/ipdps.2004.1303239
|View full text |Cite
|
Sign up to set email alerts
|

System-level fault-tolerance in large-scale parallel machines with buffered coscheduling

Abstract: Los Alamos National Laboratory, an affirmative actionlequal opportunity employer, is operated by the University of California for the US. Department of Energy under contract W-7405-ENG-36. By acceptance of this article, the publisher recognizes that the US. Government retains a nonexclusive, royalty-free license to publish or reproduce the published form of this contribution, or to allow others to do so, for US. Government purposes. Los Alamos National Laboratory requests that the publisher identify this artic… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
3
1
1

Citation Types

0
16
0

Publication Types

Select...
4
4

Relationship

1
7

Authors

Journals

citations
Cited by 17 publications
(16 citation statements)
references
References 6 publications
0
16
0
Order By: Relevance
“…In a system where the failure of a single component can cause the entire application to fail, the MTBF of the system can be defined as (M ) [16]:…”
Section: Effects Of Temperature Control On Reliabilitymentioning
confidence: 99%
“…In a system where the failure of a single component can cause the entire application to fail, the MTBF of the system can be defined as (M ) [16]:…”
Section: Effects Of Temperature Control On Reliabilitymentioning
confidence: 99%
“…Nevertheless, existing research has shown that checkpointing can cause severe performance degradation if used too frequently. Moreover, such a reactive approach suffers from non-trivial recovery cost and operational cost [19,24]. Hence, a new fault tolerant approach is needed to improve system resilience to failures in HPC.…”
Section: Introductionmentioning
confidence: 99%
“…in the order of minutes) [8]. Typical examples include the warnings produced by hardware sensors [1,12,16] regarding potential hardware problems or by software-based predictive methods using data mining and machine learning techniques [2,10,29].Considerable research has been conducted on fault-aware scheduling [4,22,24,28,30]. This research mainly focus on intelligent job allocation based on global failure distribution functions such as exponential, Weibull, or other long-term probabilities, rather than utilizing short-term fault prediction at runtime.…”
Section: Introductionmentioning
confidence: 99%
“…This requires a mechanism for determining what has changed and can entail considerable bookkeeping in the general case. A recent feasibility study obtained on a state-of-the-art cluster showed that efficient, scalable, automatic, and user-transparent incremental checkpointing is within reach with current technology [12]. Specifically, the study shows that current standard storage devices and high-performance networks provide sufficient bandwidth to allow frequent incremental checkpointing of a suite of scientific applications of interest with negligible degradation of application performance.…”
Section: Checkpoint/restartmentioning
confidence: 93%
“…To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. SC'05 November [12][13][14][15][16][17][18]2005, Seattle, Washington, USA (c) 2005 ACM 1-59593-061-2/05/0011. .…”
Section: Introductionmentioning
confidence: 99%