Hybrid Checkpointing for MPI Jobs in HPC Environments

Wang, Chao; Mueller, Frank; Engelmann, Christian; Scott, Stephen L.

doi:10.1109/icpads.2010.48

Cited by 44 publications

(26 citation statements)

References 25 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Today's Petascale systems use a combination of hardware, firmware, and system software techniques to hide many errors from applications, resulting in a mean time between failures or interruptions (MTBF/I) of 6.5-40 hours [1], [2]. Looking forward to Exascale, members of the community expect that both the sheer scale of components, and the move toward heterogeneous architectures, near-threshold computing, and aggressive power management will compound the resiliency challenge so that, with the current techniques, the time to handle system resilience may exceed the mean time to interrupt of top supercomputers before 2015 [3].…”

Section: Introductionmentioning

confidence: 99%

Classifying soft error vulnerabilities in extreme-Scale scientific applications using a binary instrumentation tool

Liu

Vetter

2012

2012 International Conference for High Performance Computing, Networking, Storage and Analysis

View full text Add to dashboard Cite

Extreme-scale scientific applications are at a significant risk of being hit by soft errors on supercomputers as the scale of these systems and the component density continues to increase. In order to better understand the specific soft error vulnerabilities in scientific applications, we have built an empirical fault injection and consequence analysis tool -BIFITthat allows us to evaluate how soft errors impact applications. In particular, BIFIT is designed with capability to inject faults at very specific targets: an arbitrarily-chosen execution point and any specific data structure. We apply BIFIT to three missioncritical scientific applications and investigate the applications vulnerability to soft errors by performing thousands of statistical tests. We, then, classify each applications individual data structures based on their sensitivity to these vulnerabilities, and generalize these classifications across applications. Subsequently, these classifications can be used to apply appropriate resiliency solutions to each data structure within an application. Our study reveals that these scientific applications have a wide range of sensitivities to both the time and the location of a soft error; yet, we are able to identify intrinsic relationships between application vulnerabilities and specific types of data objects. In this regard, BIFIT enables new opportunities for future resiliency research.

show abstract

Section: Introductionmentioning

confidence: 99%

Classifying soft error vulnerabilities in extreme-Scale scientific applications using a binary instrumentation tool

Liu

Vetter

2012

2012 International Conference for High Performance Computing, Networking, Storage and Analysis

View full text Add to dashboard Cite

show abstract

“…In order to optimize the checkpointing process, many approaches introduce optimizations that decompose the checkpoints into smaller, inter-dependent pieces [48,39]. This is done in order to speed up the checkpointing performance, at the expense of having to reconstruct the checkpoint at restart time.…”

Section: Desired Features Of Crmentioning

confidence: 99%

“…However, unlike our approach, differences to previous checkpoints are stored as separate files, which raises manageability issues. Approaches such as [48], attempt to compensate for this effect using a hybrid CR mechanism that relies on incremental checkpoints to complement full checkpoints, with the purpose of avoiding indefinite accumulation of differences. Our approach avoids this problem altogether, thanks to shadowing.…”

Section: Related Workmentioning

confidence: 99%

BlobCR: Virtual disk based checkpoint-restart for HPC applications on IaaS clouds

Nicolae

Cappello

2013

Journal of Parallel and Distributed Computing

View full text Add to dashboard Cite

Infrastructure-as-a-Service (IaaS) cloud computing is gaining significant interest in industry and academia as an alternative platform for running HPC applications. Given the need to provide fault tolerance, support for suspend-resume and offline migration, an efficient Checkpoint-Restart mechanism becomes paramount in this context. We propose BlobCR, a dedicated checkpoint repository that is able to take live incremental snapshots of the whole disk attached to the virtual machine (VM) instances. BlobCR aims to minimize the performance overhead of checkpointing by persisting VM disk snapshots asynchronously in the background using a low overhead technique we call selective copy-on-write. It includes support for both application-level and process-level checkpointing, as well as support to roll back file system changes. Experiments at large scale demonstrate the benefits of our proposal both in synthetic settings and for a reallife HPC application.

show abstract

“…Some of the most desired features were (i) the possibility of pausing a long-running job for the benefit of a smaller but highly urgent job, (ii) a mechanism for dynamic resource allocation, namely reassigning nodes to already running jobs, (iii) adding nodes to a running calculation and (iv) a failover mechanism that enables a node to automatically rejoin calculations after solving/encountering a hardware problem. Research on middleware implementing these exact features as an industry standard is currently ongoing (Wang et al, 2008), but was not available in 2004. Another highly desired feature was to include the increasing computational power of standard workstations available locally in our calculations.…”

Section: The Smarttraymentioning

confidence: 99%

Parallel, distributed and GPU computing technologies in single-particle electron microscopy

Schmeißer

Heisen

Luettich

et al. 2009

Acta Crystallogr D Biol Cryst

View full text Add to dashboard Cite

Most known methods for the determination of the structure of macromolecular complexes are limited or at least restricted at some point by their computational demands. Recent developments in information technology such as multicore, parallel and GPU processing can be used to overcome these limitations. In particular, graphics processing units (GPUs), which were originally developed for rendering real-time effects in computer games, are now ubiquitous and provide unprecedented computational power for scientific applications. Each parallel-processing paradigm alone can improve overall performance; the increased computational performance obtained by combining all paradigms, unleashing the full power of today's technology, makes certain applications feasible that were previously virtually impossible. In this article, state-of-the-art paradigms are introduced, the tools and infrastructure needed to apply these paradigms are presented and a state-of-the-art infrastructure and solution strategy for moving scientific applications to the next generation of computer hardware is outlined.

show abstract

Hybrid Checkpointing for MPI Jobs in HPC Environments

Abstract: Abstract

Cited by 44 publications

References 25 publications

Classifying soft error vulnerabilities in extreme-Scale scientific applications using a binary instrumentation tool

Classifying soft error vulnerabilities in extreme-Scale scientific applications using a binary instrumentation tool

BlobCR: Virtual disk based checkpoint-restart for HPC applications on IaaS clouds

Parallel, distributed and GPU computing technologies in single-particle electron microscopy

Contact Info

Product

Resources

About