System-level fault-tolerance in large-scale parallel machines with buffered coscheduling

Soaring energy consumption, accompanied by declining reliability, together loom as the biggest hurdles for the next generation of supercomputers. Recent reports have expressed concern that reliability at exascale level could degrade to the point where failures become a norm rather than an exception. HPC researchers are focusing on improving existing fault tolerance protocols to address these concerns. Research on improving hardware reliability, i.e., machine component reliability, has also been making progress independently. In this paper, we try to bridge this gap and explore the potential of combining both software and hardware aspects towards improving reliability of HPC machines. Fault rates are known to double for every 10• C rise in core temperature. We leverage this notion to experimentally demonstrate the potential of restraining core temperatures and load balancing to achieve two-fold benefits: improving reliability of parallel machines and reducing total execution time required by applications. Our experimental results show that we can improve the reliability of a machine by a factor of 2.3 and reduce the execution time by 12%. In addition, our scheme can also reduce machine energy consumption by as much as 25%. For a 350K socket machine, regular checkpoint/restart fails to make progress (less than 1% efficiency), whereas our validated model predicts an efficiency of 20% by improving the machine reliability by a factor of up to 2.29.

show abstract

“…In a system where the failure of a single component can cause the entire application to fail, the MTBF of the system can be defined as (M ) [16]:…”

Section: Effects Of Temperature Control On Reliabilitymentioning

confidence: 99%

A 'cool' way of improving the reliability of HPC machines

Sarood

Meneses

Kalé

2013

Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis

View full text Add to dashboard Cite

show abstract

“…Nevertheless, existing research has shown that checkpointing can cause severe performance degradation if used too frequently. Moreover, such a reactive approach suffers from non-trivial recovery cost and operational cost [19,24]. Hence, a new fault tolerant approach is needed to improve system resilience to failures in HPC.…”

Section: Introductionmentioning

confidence: 99%

“…in the order of minutes) [8]. Typical examples include the warnings produced by hardware sensors [1,12,16] regarding potential hardware problems or by software-based predictive methods using data mining and machine learning techniques [2,10,29].Considerable research has been conducted on fault-aware scheduling [4,22,24,28,30]. This research mainly focus on intelligent job allocation based on global failure distribution functions such as exponential, Weibull, or other long-term probabilities, rather than utilizing short-term fault prediction at runtime.…”

Section: Introductionmentioning

confidence: 99%

Fault-Driven Re-Scheduling For Improving System-level Fault Resilience

Gujrati

Lan

et al. 2007

2007 International Conference on Parallel Processing (ICPP 2007)

View full text Add to dashboard Cite

The productivity of HPC system is determined not only by their performance, but also by their reliability. The conventional method to limit the impact of failures is checkpointing. However, existing research shows that such a reactive fault tolerance approach can only improve system productivity marginally. Leveraging the recent progress made in the field of failure prediction, we propose fault-driven rescheduling (FARS) to improve system resilience to failures, and investigate the feasibility and effectiveness of utilizing failure prediction to dynamically adjust the placement of active jobs (e.g. running jobs) in response to failure prediction. In particular, a rescheduling algorithm is designed to enable effective job adjustment by evaluating performance impact of potential failures and rescheduling on user jobs. The proposed FARS complements existing research on fault-aware scheduling by allowing user jobs to avoid imminent failures at runtime. We evaluate FARS by using actual workloads and failure events collected from production HPC systems. Our preliminary results show the potential of FARS on improving system resilience to failures.

show abstract

“…This requires a mechanism for determining what has changed and can entail considerable bookkeeping in the general case. A recent feasibility study obtained on a state-of-the-art cluster showed that efficient, scalable, automatic, and user-transparent incremental checkpointing is within reach with current technology [12]. Specifically, the study shows that current standard storage devices and high-performance networks provide sufficient bandwidth to allow frequent incremental checkpointing of a suite of scientific applications of interest with negligible degradation of application performance.…”

Section: Checkpoint/restartmentioning

confidence: 93%

“…To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. SC'05 November [12][13][14][15][16][17][18]2005, Seattle, Washington, USA (c) 2005 ACM 1-59593-061-2/05/0011. .…”

Section: Introductionmentioning

confidence: 99%

Transparent, Incremental Checkpointing at Kernel Level: a Foundation for Fault Tolerance for Parallel Computers

Gioiosa

Sancho

Jiang

et al.

ACM/IEEE SC 2005 Conference (SC'05)

Self Cite

117

View full text Add to dashboard Cite

We describe the software architecture, technical features, and performance of TICK (Transparent Incremental Checkpointer at Kernel level), a system-level checkpointer implemented as a kernel thread, specifically designed to provide fault tolerance in Linux clusters. This implementation, based on the 2.6.11 Linux kernel, provides the essential functionality for transparent, highly responsive, and efficient fault tolerance based on full or incremental checkpointing at system level. TICK is completely user-transparent and does not require any changes to user code or system libraries; it is highly responsive: an interrupt, such as a timer interrupt, can trigger a checkpoint in as little as 2.5µs; and it supports incremental and full checkpoints with minimal overhead-less than 6% with full checkpointing to disk performed as frequently as once per minute.

show abstract

System-level fault-tolerance in large-scale parallel machines with buffered coscheduling

Cited by 17 publications

References 6 publications

A 'cool' way of improving the reliability of HPC machines

A 'cool' way of improving the reliability of HPC machines

Fault-Driven Re-Scheduling For Improving System-level Fault Resilience

Transparent, Incremental Checkpointing at Kernel Level: a Foundation for Fault Tolerance for Parallel Computers

Contact Info

Product

Resources

About