Proceedings of the 5th Workshop on Fault Tolerance for HPC at eXtreme Scale 2015
DOI: 10.1145/2751504.2751509
|View full text |Cite
|
Sign up to set email alerts
|

How Much SSD Is Useful for Resilience in Supercomputers

Abstract: We consider the use of non-volatile memories in the form of burst buffers for resilience in supercomputers. Their cost and limited lifetime demand effective use and appropriate provisioning. We develop an analytic model for the behavior of workloads on systems with burst buffers, and use it to explore questions of cost-effective provisioning, and missiondirected allocation of burst-buffer (SSD) lifetime.First, our results show that system efficiency can be increased by as much as 14% by considering a global pe… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
3
1
1

Citation Types

0
5
0

Year Published

2016
2016
2020
2020

Publication Types

Select...
4
1
1

Relationship

0
6

Authors

Journals

citations
Cited by 7 publications
(5 citation statements)
references
References 27 publications
0
5
0
Order By: Relevance
“…In order to better utilize SSD devices under scientific I/O workloads, [14] presents a checkpoint interval optimization model for large-scale scientific applications. This model is essentially same as those developed by [26,27,28] whose objective is to maximize the computational efficiency of HPC systems, but it also takes the constraint of burst buffer capacity into consideration.…”
Section: Related Workmentioning
confidence: 99%
See 3 more Smart Citations
“…In order to better utilize SSD devices under scientific I/O workloads, [14] presents a checkpoint interval optimization model for large-scale scientific applications. This model is essentially same as those developed by [26,27,28] whose objective is to maximize the computational efficiency of HPC systems, but it also takes the constraint of burst buffer capacity into consideration.…”
Section: Related Workmentioning
confidence: 99%
“…When model AUBNP is used, all the checkpoint data is written to the burst buffer without limit while no checkpoint is written to the PFS directly, and the checkpoint interval is determined through adaptive algorithm named "lazy checkpoint" (see section 3.2). When model SLBNP is used, limited checkpoint data is written to the burst buffer while no checkpoint is written to the PFS directly, and the checkpoint interval is determined through the optimization model proposed in [14] (see section 3.3). SLBUP and ALBUP are two models proposed in this paper, which all limit the checkpoint data written to the burst buffer while leverage the PFS to keep the checkpoint frequency from decreasing too much (see section 4 and 5 for more details).…”
Section: Evaluation Setupmentioning
confidence: 99%
See 2 more Smart Citations
“…A wide array of checkpoint-restart research has explored techniques to efficiently apply checkpointing (Daly, 2006;Young, 1974) and improve its performance (Antypas et al, 2014;Bautista-Gomez et al, 2011;Cappello et al, 2011;Fang and Chien, 2015;Moody et al, 2010). Notably, recent advances that exploit high bandwidth non-volatile memories both reduce checkpoint cost dramatically, and because their efficiency reduces the optimal checkpoint interval, can do ''micro-checkpointing'' (fast checkpointing with interval of seconds), reducing the work lost per detected error or process failure.…”
Section: Introductionmentioning
confidence: 99%