2010 IEEE 16th International Conference on Parallel and Distributed Systems 2010
DOI: 10.1109/icpads.2010.48
|View full text |Cite
|
Sign up to set email alerts
|

Hybrid Checkpointing for MPI Jobs in HPC Environments

Abstract: Abstract

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1
1

Citation Types

0
25
0

Year Published

2012
2012
2019
2019

Publication Types

Select...
7
1
1

Relationship

0
9

Authors

Journals

citations
Cited by 44 publications
(26 citation statements)
references
References 25 publications
0
25
0
Order By: Relevance
“…Today's Petascale systems use a combination of hardware, firmware, and system software techniques to hide many errors from applications, resulting in a mean time between failures or interruptions (MTBF/I) of 6.5-40 hours [1], [2]. Looking forward to Exascale, members of the community expect that both the sheer scale of components, and the move toward heterogeneous architectures, near-threshold computing, and aggressive power management will compound the resiliency challenge so that, with the current techniques, the time to handle system resilience may exceed the mean time to interrupt of top supercomputers before 2015 [3].…”
Section: Introductionmentioning
confidence: 99%
“…Today's Petascale systems use a combination of hardware, firmware, and system software techniques to hide many errors from applications, resulting in a mean time between failures or interruptions (MTBF/I) of 6.5-40 hours [1], [2]. Looking forward to Exascale, members of the community expect that both the sheer scale of components, and the move toward heterogeneous architectures, near-threshold computing, and aggressive power management will compound the resiliency challenge so that, with the current techniques, the time to handle system resilience may exceed the mean time to interrupt of top supercomputers before 2015 [3].…”
Section: Introductionmentioning
confidence: 99%
“…In order to optimize the checkpointing process, many approaches introduce optimizations that decompose the checkpoints into smaller, inter-dependent pieces [48,39]. This is done in order to speed up the checkpointing performance, at the expense of having to reconstruct the checkpoint at restart time.…”
Section: Desired Features Of Crmentioning
confidence: 99%
“…However, unlike our approach, differences to previous checkpoints are stored as separate files, which raises manageability issues. Approaches such as [48], attempt to compensate for this effect using a hybrid CR mechanism that relies on incremental checkpoints to complement full checkpoints, with the purpose of avoiding indefinite accumulation of differences. Our approach avoids this problem altogether, thanks to shadowing.…”
Section: Related Workmentioning
confidence: 99%
“…Some of the most desired features were (i) the possibility of pausing a long-running job for the benefit of a smaller but highly urgent job, (ii) a mechanism for dynamic resource allocation, namely reassigning nodes to already running jobs, (iii) adding nodes to a running calculation and (iv) a failover mechanism that enables a node to automatically rejoin calculations after solving/encountering a hardware problem. Research on middleware implementing these exact features as an industry standard is currently ongoing (Wang et al, 2008), but was not available in 2004. Another highly desired feature was to include the increasing computational power of standard workstations available locally in our calculations.…”
Section: The Smarttraymentioning
confidence: 99%