2012 IEEE 24th International Symposium on Computer Architecture and High Performance Computing 2012
DOI: 10.1109/sbac-pad.2012.12
|View full text |Cite
|
Sign up to set email alerts
|

Assessing Energy Efficiency of Fault Tolerance Protocols for HPC Systems

Abstract: Abstract-An exascale machine is expected to be delivered in the time frame 2018-2020. Such a machine will be able to tackle some of the hardest computational problems and to extend our understanding of Nature and the universe. However, to make that a reality, the HPC community has to solve a few important challenges. Resilience will become a prominent problem because an exascale machine will experience frequent failures due to the large amount of components it will encompass. Some form of fault tolerance has t… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

0
32
0

Year Published

2013
2013
2022
2022

Publication Types

Select...
3
3
1

Relationship

3
4

Authors

Journals

citations
Cited by 28 publications
(32 citation statements)
references
References 10 publications
0
32
0
Order By: Relevance
“…This paper extends the material presented in our previous publication [7] by refining the analytical formulation to model the energy consumption of the different fault-tolerance protocols, extending the experimental results on new and more accurate power-measuring hardware, and improving the projections to extreme scale systems. The contributions of this paper are the following:…”
Section: Introductionmentioning
confidence: 66%
See 1 more Smart Citation
“…This paper extends the material presented in our previous publication [7] by refining the analytical formulation to model the energy consumption of the different fault-tolerance protocols, extending the experimental results on new and more accurate power-measuring hardware, and improving the projections to extreme scale systems. The contributions of this paper are the following:…”
Section: Introductionmentioning
confidence: 66%
“…The checkpoint and restart time are based on the algorithm described in Section 2 and the match expectations at large scale [1]. The parameters for message logging and parallel restart are based on empirical evidence we have collected [5,7,15]. Finally, the power levels H and L are based on the experimental results of Section 4.…”
Section: Extreme-scale Projectionsmentioning
confidence: 99%
“…It then requires only a local rollback and saves energy by having the rest of the system idle or making progress on their own [23]. It may save time too, because messages have no delay or contention during recovery.…”
Section: Message Loggingmentioning
confidence: 99%
“…If only the crashed PE is required to roll back and restart, important energy savings can be obtained [23]. However, message logging needs some metadata to be managed.…”
Section: Message Loggingmentioning
confidence: 99%
“…A technique called parallel recovery [36] leverages message-logging by distributing the tasks on the failed node to be recovered in parallel on other nodes of the system. This mechanism has been demonstrated to tolerate a higher failure rate [37]. More recently, replication of tasks has been proposed to deal with high failure rates [38].…”
Section: Related Workmentioning
confidence: 99%