2013 IEEE 27th International Symposium on Parallel and Distributed Processing 2013
DOI: 10.1109/ipdps.2013.74
|View full text |Cite
|
Sign up to set email alerts
|

Improving the Computing Efficiency of HPC Systems Using a Combination of Proactive and Preventive Checkpointing

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
3
1
1

Citation Types

2
29
0

Year Published

2014
2014
2021
2021

Publication Types

Select...
3
3
2

Relationship

0
8

Authors

Journals

citations
Cited by 42 publications
(32 citation statements)
references
References 28 publications
2
29
0
Order By: Relevance
“…Other advances concern the combination of application-level checkpointing and failure prediction [10]. An important question is how to run the failure predictor on large infrastructures.…”
Section: Failure Predictionmentioning
confidence: 99%
See 2 more Smart Citations
“…Other advances concern the combination of application-level checkpointing and failure prediction [10]. An important question is how to run the failure predictor on large infrastructures.…”
Section: Failure Predictionmentioning
confidence: 99%
“…This approach faces two difficulties: (1) local failure prediction will impose an overhead on the application running on the node, and (2) local failure prediction is less accurate than global failure prediction because the failure predictors have only a local view. These two difficulties are explained in [10]. Another important question is how to compute the optimal interval of preventive checkpoints when a proportion of the failures are predicted [10].…”
Section: Failure Predictionmentioning
confidence: 99%
See 1 more Smart Citation
“…If multiple applications run concurrently, a dataaware compression scheme [29] was proposed to improve the overall checkpointing efficiency. Recent study [30] shows that combining failure detection and proactive checkpointing could improve 30% efficiency compared to classical periodical checkpointing. Thus data compression has the potential to be combined with failure detection and proactive checkpointing to further improve the system efficiency.…”
Section: Related Workmentioning
confidence: 99%
“…Event analysis and classification in large-scale system have been the subject of many studies, and research has shown that they can lead to good prediction results [8], [9]. One of the major challenges in this endeavor is the impact of false positives.…”
Section: B Introspective Systemsmentioning
confidence: 99%