Design and modeling of a non-blocking checkpointing system

Sato, Kento; Moody, Adam; Mohror, Kathryn; Maruyama, Naoya; Gamblin, Todd; Supinski, Bronis R. de; Matsuoka, Satoshi

doi:10.1109/sc.2012.46

Cited by 64 publications

(62 citation statements)

References 13 publications

(26 reference statements)

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Third, checkpoints are copied from main memory to the non-volatile storage of the node. Since, a combination between multi-level and non-blocking checkpointing can benefit the performance of checkpointing [10], in our checkpointing architecture, FPGA does not wait until its all checkpoints are written to the non-volatile storage of the node, but resumes the normal operations immediately after the all checkpoints are written to Capture FIFO.…”

Section: Cpr Gatementioning

confidence: 99%

A Tree-Based Checkpointing Architecture for the Dependability of FPGA Computing

Takamaeda-Yamazaki

Nakada

et al. 2018

IEICE Trans. Inf. & Syst.

View full text Add to dashboard Cite

SUMMARYModern FPGAs have been integrated in computing systems as accelerators for long running applications. This integration puts more pressure on the fault tolerance of computing systems, and the requirement for dependability becomes essential. As in the case of CPU-based system, checkpoint/restart techniques are also expected to improve the dependability of FPGA-based computing. Three issues arise in this situation: how to checkpoint and restart FPGAs, how well this checkpoint/restart model works with the checkpoint/restart model of the whole computing system, and how to build the model by a software tool. In this paper, we first present a new checkpoint/restart architecture along with a checkpointing mechanism on FPGAs. We then propose a method to capture consistent snapshots of FPGA and the rest of the computing system. Third, we provide "fine-grained" management for checkpointing to reduce performance degradation. For the host CPU, we also provide a stack which includes API functions to manage checkpoint/restart procedures on FPGAs. Fourth, we present a Python-based tool to insert checkpointing infrastructure. Experimental results show that the checkpointing architecture causes less than 10% maximum clock frequency degradation, low checkpointing latencies, small memory footprints, and small increases in power consumption, while the LUT overhead varies from 17.98% (Dijkstra) to 160.67% (Matrix Multiplication).

show abstract

Section: Cpr Gatementioning

confidence: 99%

A Tree-Based Checkpointing Architecture for the Dependability of FPGA Computing

Takamaeda-Yamazaki

Nakada

et al. 2018

IEICE Trans. Inf. & Syst.

View full text Add to dashboard Cite

show abstract

“…TSUBAME2.0 FDH has 4 levels [38]: nodes, power supply units (PSUs), edge switches, and racks (h = 4) [38]. Then, to get P cf , we calculate distributions Pj(xj) that determine the probability of xj concurrent crashes at level j of the TSUBAME FDH.…”

Section: Analysis Of Protocol Resiliencementioning

confidence: 99%

“…Two popular resilience schemes used in today's computing environments are coordinated checkpointing (CC) and uncoordinated checkpointing augmented with message logging (UC) [17]. In CC applications regularly synchronize to save their state to memory, local disks, or parallel file system (PFS) [38]; this data is used to restart after a crash. In UC processes take checkpoints independently and use message logging to avoid rollbacks caused by the domino effect [37].…”

Section: Introductionmentioning

confidence: 99%

Fault tolerance for remote memory access programming models

Besta

Hoefler

2014

Proceedings of the 23rd International Symposium on High-Performance Parallel and Distributed Computing

View full text Add to dashboard Cite

Remote Memory Access (RMA) is an emerging mechanism for programming high-performance computers and datacenters. However, little work exists on resilience schemes for RMA-based applications and systems. In this paper we analyze fault tolerance for RMA and show that it is fundamentally different from resilience mechanisms targeting the message passing (MP) model. We design a model for reasoning about fault tolerance for RMA, addressing both flat and hierarchical hardware. We use this model to construct several highly-scalable mechanisms that provide efficient low-overhead in-memory checkpointing, transparent logging of remote memory accesses, and a scheme for transparent recovery of failed processes. Our protocols take into account diminishing amounts of memory per core, one of the major features of future exascale machines. The implementation of our fault-tolerance scheme entails negligible additional overheads. Our reliability model shows that inmemory checkpointing and logging provide high resilience. This study enables highly-scalable resilience mechanisms for RMA and fills a research gap between fault tolerance and emerging RMA programming models.

show abstract

“…The checkpoint period can be defined in different ways. Checkpoints also can be moved between levels in various ways, for example, by using a dedicated thread [4] or agents running on additional nodes [87]). A new semi-blocking checkpoint protocol leverages multiple levels of checkpoint to decrease checkpoint time [80].…”

Section: Toward Exascale Resilience: 2014 Updatementioning

confidence: 99%

Toward Exascale Resilience: 2014 update

Cappello

Geist

Gropp

et al. 2014

JSFI

View full text Add to dashboard Cite

Resilience is a major roadblock for HPC executions on future exascale systems. These systems will typically gather millions of CPU cores running up to a billion threads. Projections from current large systems and technology evolution predict errors will happen in exascale systems many times per day. These errors will propagate and generate various kinds of malfunctions, from simple process crashes to result corruptions.The past five years have seen extraordinary technical progress in many domains related to exascale resilience. Several technical options, initially considered inapplicable or unrealistic in the HPC context, have demonstrated surprising successes. Despite this progress, the exascale resilience problem is not solved, and the community is still facing the difficult challenge of ensuring that exascale applications complete and generate correct results while running on unstable systems. Since 2009, many workshops, studies, and reports have improved the definition of the resilience problem and provided refined recommendations. Some projections made during the previous decades and some priorities established from these projections need to be revised. This paper surveys what the community has learned in the past five years and summarizes the research problems still considered critical by the HPC community.

show abstract

Design and modeling of a non-blocking checkpointing system

Cited by 64 publications

References 13 publications

A Tree-Based Checkpointing Architecture for the Dependability of FPGA Computing

A Tree-Based Checkpointing Architecture for the Dependability of FPGA Computing

Fault tolerance for remote memory access programming models

Toward Exascale Resilience: 2014 update

Contact Info

Product

Resources

About