Aiman Fang scite author profile

2015

We consider the use of non-volatile memories in the form of burst buffers for resilience in supercomputers. Their cost and limited lifetime demand effective use and appropriate provisioning. We develop an analytic model for the behavior of workloads on systems with burst buffers, and use it to explore questions of cost-effective provisioning, and missiondirected allocation of burst-buffer (SSD) lifetime.First, our results show that system efficiency can be increased by as much as 14% by considering a global perspective (workload mix, job size) for SSD lifetime allocation. Second, with size-based and system-efficiency based lifetime allocation, large jobs suffer as much as 40% job efficiency loss; job-efficiency based allocation must increase their allocations by 50% to eliminate this disparity. Finally, further results suggest that underprovisioning SSD lifetime (only 10-20% of the "optimum" as defined by per-job requirements without resource constraint) is sufficient to produce 90% system efficiency at failure rates three times that of current systems.

show abstract

Towards Understanding Post-recovery Efficiency for Shrinking and Non-shrinking Recovery

Fujita

2015

Exploring versioned distributed arrays for resilience in scientific applications

The International Journal of High Performance Computing Applica

Balaji

Dun

et al. 2016

Exascale studies project reliability challenges for future HPC systems. We present the Global View Resilience (GVR) system, a library for portable resilience. GVR begins with a subset of the Global Arrays interface, and adds new capabilities to create versions, name versions, and compute on version data. Applications can focus versioning where and when it is most productive, and customize for each application structure independently. This control is portable, and its embedding in application source makes it natural to express and easy to maintain. The ability to name multiple versions and ''partially materialize'' them efficiently makes ambitious forward-recovery based on ''data slices'' across versions or data structures both easy to express and efficient. Using several large applications (OpenMC, preconditioned conjugate gradient (PCG) solver, ddcMD, and Chombo), we evaluate the programming effort to add resilience. The required changes are small (\ 2% lines of code (LOC)), localized and machine-independent, and perhaps most important, require no software architecture changes. We also measure the overhead of adding GVR versioning and show that overheads \ 2% are generally achieved. This overhead suggests that GVR can be implemented in large-scale codes and support portable error recovery with modest investment and runtime impact. Our results are drawn from both IBM BG/Q and Cray XC30 experiments, demonstrating portability. We also present two case studies of flexible error recovery, illustrating how GVR can be used for multi-version rollback recovery, and several different forward-recovery schemes. GVR's multi-version enables applications to survive latent errors (silent data corruption) with significant detection latency, and forward recovery can make that recovery extremely efficient. Our results suggest that GVR is scalable, portable, and efficient. GVR interfaces are flexible, supporting a variety of recovery schemes, and altogether GVR embodies a gentleslope path to tolerate growing error rates in future extreme-scale systems.

show abstract

Multi-versioning Performance Opportunities in BGAS System for Resilience

Dun

Pleiter

et al. 2016

WITHDRAWN: Versioned Distributed Arrays for Resilience in Scientific Applications: Global View Resilience

Journal of Computational Science

Balaji

Beckman

et al. 2015

Flexible Error Recovery Using Versions in Global View Resilience

Dun

Fujita

et al. 2015