2014
DOI: 10.1016/j.parco.2014.03.005
|View full text |Cite
|
Sign up to set email alerts
|

Energy profile of rollback-recovery strategies in high performance computing

Abstract: Extreme-scale computing is set to provide the infrastructure for the advances and breakthroughs that will solve some of the hardest problems in science and engineering. However, resilience and energy concerns loom as two of the major challenges for machines at that scale. The number of components that will be assembled in the supercomputers plays a fundamental role in these challenges. First, a large number of parts will substantially increase the failure rate of the system compared to the failure frequency of… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

0
22
0

Year Published

2014
2014
2020
2020

Publication Types

Select...
3
3
1

Relationship

3
4

Authors

Journals

citations
Cited by 21 publications
(22 citation statements)
references
References 19 publications
(23 reference statements)
0
22
0
Order By: Relevance
“…That leads to a huge waste of time and energy [8], [9]. An enhancement to checkpoint/restart is message logging [15], a technique that stores checkpoints and, in principle, stores all the messages in an execution.…”
Section: Message Loggingmentioning
confidence: 99%
See 3 more Smart Citations
“…That leads to a huge waste of time and energy [8], [9]. An enhancement to checkpoint/restart is message logging [15], a technique that stores checkpoints and, in principle, stores all the messages in an execution.…”
Section: Message Loggingmentioning
confidence: 99%
“…The performance overhead of those strategies can be kept low [11], [12], they feature a very efficient energy profile [8], [9], and they make possible to parallelize recovery [10]. To leverage all those features, it is imperative to address the major drawback of message logging, namely its increase in memory footprint.…”
Section: Message-logging Protocolmentioning
confidence: 99%
See 2 more Smart Citations
“…Some teams [76] developed models for expected run time and energy consumption for global recovery, message logging, and parallel recovery protocols. These models show in an exascale scenario that parallel recovery outperforms coordinated checkpointing protocols since parallel recovery reduces the rework time.…”
Section: Energy Consumptionmentioning
confidence: 99%