Assessing Energy Efficiency of Fault Tolerance Protocols for HPC Systems

Meneses, Esteban; Sarood, Osman; Kalé, Laxmikant V.

doi:10.1109/sbac-pad.2012.12

Cited by 28 publications

(32 citation statements)

References 10 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…This paper extends the material presented in our previous publication [7] by refining the analytical formulation to model the energy consumption of the different fault-tolerance protocols, extending the experimental results on new and more accurate power-measuring hardware, and improving the projections to extreme scale systems. The contributions of this paper are the following:…”

Section: Introductionmentioning

confidence: 66%

“…The checkpoint and restart time are based on the algorithm described in Section 2 and the match expectations at large scale [1]. The parameters for message logging and parallel restart are based on empirical evidence we have collected [5,7,15]. Finally, the power levels H and L are based on the experimental results of Section 4.…”

Section: Extreme-scale Projectionsmentioning

confidence: 99%

See 1 more Smart Citation

Energy profile of rollback-recovery strategies in high performance computing

Meneses¹,

Sarood²,

Kalé³

2014

Parallel Computing

Self Cite

View full text Add to dashboard Cite

Extreme-scale computing is set to provide the infrastructure for the advances and breakthroughs that will solve some of the hardest problems in science and engineering. However, resilience and energy concerns loom as two of the major challenges for machines at that scale. The number of components that will be assembled in the supercomputers plays a fundamental role in these challenges. First, a large number of parts will substantially increase the failure rate of the system compared to the failure frequency of current machines. Second, those components have to fit within the power envelope of the installation and keep the energy consumption within operational margins. Extreme-scale machines will have to incorporate fault tolerance mechanisms and honor the energy and power restrictions. Therefore, it is essential to understand how fault tolerance and energy consumption interplay. This paper presents a comparative evaluation and analysis of energy consumption in three different rollback-recovery protocols: checkpoint/restart, message logging and parallel recovery. Our experimental evaluation shows parallel recovery has the minimum execution time and energy consumption. Additionally, we present an analytical model that projects parallel recovery can reduce energy consumption more than 37% compared to checkpoint/restart at extreme scale.

show abstract

Section: Introductionmentioning

confidence: 66%

Section: Extreme-scale Projectionsmentioning

confidence: 99%

Energy profile of rollback-recovery strategies in high performance computing

Meneses¹,

Sarood²,

Kalé³

2014

Parallel Computing

Self Cite

View full text Add to dashboard Cite

show abstract

“…It then requires only a local rollback and saves energy by having the rest of the system idle or making progress on their own [23]. It may save time too, because messages have no delay or contention during recovery.…”

Section: Message Loggingmentioning

confidence: 99%

“…If only the crashed PE is required to roll back and restart, important energy savings can be obtained [23]. However, message logging needs some metadata to be managed.…”

Section: Message Loggingmentioning

confidence: 99%

Using Migratable Objects to Enhance Fault Tolerance Schemes in Supercomputers

Meneses

Zheng

et al. 2015

IEEE Trans. Parallel Distrib. Syst.

Self Cite

View full text Add to dashboard Cite

Abstract-Supercomputers have seen an exponential increase in their size in the last two decades. Such a high growth rate is expected to take us to exascale in the timeframe 2018-2022. But, to bring a productive exascale environment about, it is necessary to focus on several key challenges. One of those challenges is fault tolerance. Machines at extreme scale will experience frequent failures and will require the system to avoid or overcome those failures. Various techniques have recently been developed to tolerate failures. The impact of these techniques and their scalability can be substantially enhanced by a parallel programming model called migratable objects. In this paper, we demonstrate how the migratable-objects model facilitates and improves several fault tolerance approaches. Our experimental results on thousands of cores suggest fault tolerance schemes based on migratable objects have low performance overhead and high scalability. Additionally, we present a performance model that predicts a significant benefit of using migratable objects to provide fault tolerance at extreme scale.

show abstract

“…A technique called parallel recovery [36] leverages message-logging by distributing the tasks on the failed node to be recovered in parallel on other nodes of the system. This mechanism has been demonstrated to tolerate a higher failure rate [37]. More recently, replication of tasks has been proposed to deal with high failure rates [38].…”

Section: Related Workmentioning

confidence: 99%

A 'cool' way of improving the reliability of HPC machines

Sarood

Meneses

Kalé

2013

Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis

Self Cite

View full text Add to dashboard Cite

Soaring energy consumption, accompanied by declining reliability, together loom as the biggest hurdles for the next generation of supercomputers. Recent reports have expressed concern that reliability at exascale level could degrade to the point where failures become a norm rather than an exception. HPC researchers are focusing on improving existing fault tolerance protocols to address these concerns. Research on improving hardware reliability, i.e., machine component reliability, has also been making progress independently. In this paper, we try to bridge this gap and explore the potential of combining both software and hardware aspects towards improving reliability of HPC machines. Fault rates are known to double for every 10• C rise in core temperature. We leverage this notion to experimentally demonstrate the potential of restraining core temperatures and load balancing to achieve two-fold benefits: improving reliability of parallel machines and reducing total execution time required by applications. Our experimental results show that we can improve the reliability of a machine by a factor of 2.3 and reduce the execution time by 12%. In addition, our scheme can also reduce machine energy consumption by as much as 25%. For a 350K socket machine, regular checkpoint/restart fails to make progress (less than 1% efficiency), whereas our validated model predicts an efficiency of 20% by improving the machine reliability by a factor of up to 2.29.

show abstract

Assessing Energy Efficiency of Fault Tolerance Protocols for HPC Systems

Cited by 28 publications

References 10 publications

Energy profile of rollback-recovery strategies in high performance computing

Energy profile of rollback-recovery strategies in high performance computing

Using Migratable Objects to Enhance Fault Tolerance Schemes in Supercomputers

A 'cool' way of improving the reliability of HPC machines

Contact Info

Product

Resources

About