How Much SSD Is Useful for Resilience in Supercomputers

Fang, Aiman; Chien, Andrew A.

doi:10.1145/2751504.2751509

Cited by 7 publications

(5 citation statements)

References 27 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…In order to better utilize SSD devices under scientific I/O workloads, [14] presents a checkpoint interval optimization model for large-scale scientific applications. This model is essentially same as those developed by [26,27,28] whose objective is to maximize the computational efficiency of HPC systems, but it also takes the constraint of burst buffer capacity into consideration.…”

Section: Related Workmentioning

confidence: 99%

“…When model AUBNP is used, all the checkpoint data is written to the burst buffer without limit while no checkpoint is written to the PFS directly, and the checkpoint interval is determined through adaptive algorithm named "lazy checkpoint" (see section 3.2). When model SLBNP is used, limited checkpoint data is written to the burst buffer while no checkpoint is written to the PFS directly, and the checkpoint interval is determined through the optimization model proposed in [14] (see section 3.3). SLBUP and ALBUP are two models proposed in this paper, which all limit the checkpoint data written to the burst buffer while leverage the PFS to keep the checkpoint frequency from decreasing too much (see section 4 and 5 for more details).…”

Section: Evaluation Setupmentioning

confidence: 99%

“…Besides frequently replacing worn-out SSD devices, another possible solution would be reducing the amount of data written to the burst buffer. In [14], the authors proposed a checkpoint interval optimization model for large-scale scientific applications which takes the constraint of burst buffer capacity into consideration. In such model, SSD-based burst buffers of supercomputers are used to absorb all checkpoint data of the scientific applications.…”

Section: Introductionmentioning

confidence: 99%

“…We use ∆t wr ckpt,i to denote the time spent on writing one checkpoint of the i-th job to the storage system. [26,27,28,14]. If the failure rate per compute node in the HPC system is λ and the number of compute nodes occupied by the i-th job is n i , the total execution time of the i-th job can be denoted as:…”

Section: Introductionmentioning

confidence: 99%

See 3 more Smart Citations

Optimizing checkpoint data placement with guaranteed burst buffer endurance in large-scale hierarchical storage systems

Wan

Cao

Wang

et al. 2017

Journal of Parallel and Distributed Computing

View full text Add to dashboard Cite

Non-volatile devices, such as SSDs, will be an integral part of the deepening storage hierarchy on large-scale HPC systems. These devices can be on the compute nodes as part of a distributed burst buffer service or they can be external. Wherever they are located in the hierarchy, one critical design issue is the SSD endurance under the write-heavy workloads, such as the checkpoint I/O for scientific applications. For these environments, it is widely assumed that checkpoint operations can occur once every 60 minutes and for each checkpoint step as much as half of the system memory can be written out. Unfortunately, for large-scale HPC applications, the burst buffer SSDs can be worn out much more quickly given the extensive amount of data written at every checkpoint step. One possible solution is to control the amount of data written by reducing the checkpoint frequency. However, a direct effect caused by reduced checkpoint frequency is increased vulnerability window of system failures and therefore potentially wasted computation time, especially for large-scale compute jobs. In this paper, we propose a new checkpoint placement optimization model which collaboratively utilizes both the burst buffer and the parallel file system to store the checkpoints, with design goals of maximizing computation efficiency while guaranteeing the SSD endurance requirements. Moreover, we present

show abstract

Section: Related Workmentioning

confidence: 99%

Section: Evaluation Setupmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

See 2 more Smart Citations

Optimizing checkpoint data placement with guaranteed burst buffer endurance in large-scale hierarchical storage systems

Wan

Cao

Wang

et al. 2017

Journal of Parallel and Distributed Computing

View full text Add to dashboard Cite

show abstract

“…A wide array of checkpoint-restart research has explored techniques to efficiently apply checkpointing (Daly, 2006;Young, 1974) and improve its performance (Antypas et al, 2014;Bautista-Gomez et al, 2011;Cappello et al, 2011;Fang and Chien, 2015;Moody et al, 2010). Notably, recent advances that exploit high bandwidth non-volatile memories both reduce checkpoint cost dramatically, and because their efficiency reduces the optimal checkpoint interval, can do ''micro-checkpointing'' (fast checkpointing with interval of seconds), reducing the work lost per detected error or process failure.…”

Section: Introductionmentioning

confidence: 99%

Exploring versioned distributed arrays for resilience in scientific applications

Chien

Balaji

Dun

et al. 2016

The International Journal of High Performance Computing Applica

View full text Add to dashboard Cite

Exascale studies project reliability challenges for future HPC systems. We present the Global View Resilience (GVR) system, a library for portable resilience. GVR begins with a subset of the Global Arrays interface, and adds new capabilities to create versions, name versions, and compute on version data. Applications can focus versioning where and when it is most productive, and customize for each application structure independently. This control is portable, and its embedding in application source makes it natural to express and easy to maintain. The ability to name multiple versions and ''partially materialize'' them efficiently makes ambitious forward-recovery based on ''data slices'' across versions or data structures both easy to express and efficient. Using several large applications (OpenMC, preconditioned conjugate gradient (PCG) solver, ddcMD, and Chombo), we evaluate the programming effort to add resilience. The required changes are small (\ 2% lines of code (LOC)), localized and machine-independent, and perhaps most important, require no software architecture changes. We also measure the overhead of adding GVR versioning and show that overheads \ 2% are generally achieved. This overhead suggests that GVR can be implemented in large-scale codes and support portable error recovery with modest investment and runtime impact. Our results are drawn from both IBM BG/Q and Cray XC30 experiments, demonstrating portability. We also present two case studies of flexible error recovery, illustrating how GVR can be used for multi-version rollback recovery, and several different forward-recovery schemes. GVR's multi-version enables applications to survive latent errors (silent data corruption) with significant detection latency, and forward recovery can make that recovery extremely efficient. Our results suggest that GVR is scalable, portable, and efficient. GVR interfaces are flexible, supporting a variety of recovery schemes, and altogether GVR embodies a gentleslope path to tolerate growing error rates in future extreme-scale systems.

show abstract