2012 International Conference for High Performance Computing, Networking, Storage and Analysis 2012
DOI: 10.1109/sc.2012.46
|View full text |Cite
|
Sign up to set email alerts
|

Design and modeling of a non-blocking checkpointing system

Abstract: Abstract-As the capability and component count of systems increase, the MTBF decreases. Typically, applications tolerate failures with checkpoint/restart to a parallel file system (PFS). While simple, this approach can suffer from contention for PFS resources. Multi-level checkpointing is a promising solution. However, while multi-level checkpointing is successful on today s machines, it is not expected to be sufficient for exascale class machines, which are predicted to have orders of magnitude larger memory … Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

0
59
0

Year Published

2014
2014
2020
2020

Publication Types

Select...
4
4
1

Relationship

0
9

Authors

Journals

citations
Cited by 64 publications
(62 citation statements)
references
References 13 publications
(26 reference statements)
0
59
0
Order By: Relevance
“…Third, checkpoints are copied from main memory to the non-volatile storage of the node. Since, a combination between multi-level and non-blocking checkpointing can benefit the performance of checkpointing [10], in our checkpointing architecture, FPGA does not wait until its all checkpoints are written to the non-volatile storage of the node, but resumes the normal operations immediately after the all checkpoints are written to Capture FIFO.…”
Section: Cpr Gatementioning
confidence: 99%
“…Third, checkpoints are copied from main memory to the non-volatile storage of the node. Since, a combination between multi-level and non-blocking checkpointing can benefit the performance of checkpointing [10], in our checkpointing architecture, FPGA does not wait until its all checkpoints are written to the non-volatile storage of the node, but resumes the normal operations immediately after the all checkpoints are written to Capture FIFO.…”
Section: Cpr Gatementioning
confidence: 99%
“…TSUBAME2.0 FDH has 4 levels [38]: nodes, power supply units (PSUs), edge switches, and racks (h = 4) [38]. Then, to get P cf , we calculate distributions Pj(xj) that determine the probability of xj concurrent crashes at level j of the TSUBAME FDH.…”
Section: Analysis Of Protocol Resiliencementioning
confidence: 99%
“…Two popular resilience schemes used in today's computing environments are coordinated checkpointing (CC) and uncoordinated checkpointing augmented with message logging (UC) [17]. In CC applications regularly synchronize to save their state to memory, local disks, or parallel file system (PFS) [38]; this data is used to restart after a crash. In UC processes take checkpoints independently and use message logging to avoid rollbacks caused by the domino effect [37].…”
Section: Introductionmentioning
confidence: 99%
“…The checkpoint period can be defined in different ways. Checkpoints also can be moved between levels in various ways, for example, by using a dedicated thread [4] or agents running on additional nodes [87]). A new semi-blocking checkpoint protocol leverages multiple levels of checkpoint to decrease checkpoint time [80].…”
Section: Toward Exascale Resilience: 2014 Updatementioning
confidence: 99%