The platform will undergo maintenance on Sep 14 at about 7:45 AM EST and will be unavailable for approximately 2 hours.
2020 20th IEEE/ACM International Symposium on Cluster, Cloud and Internet Computing (CCGRID) 2020
DOI: 10.1109/ccgrid49817.2020.00-69
|View full text |Cite
|
Sign up to set email alerts
|

Checkpoint Restart Support for Heterogeneous HPC Applications

Abstract: As we approach the era of exa-scale computing, fault tolerance is of growing importance. The increasing number of cores as well as the increased complexity of modern heterogenous systems result in substantial decrease of the expected mean time between failures. Among the different fault tolerance techniques, checkpoint/restart is vastly adopted in supercomputing systems. Although many supercomputers in the TOP 500 list use GPUs, only a few checkpoint restart mechanism support GPUs. In this paper, we extend an … Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

0
9
0

Year Published

2020
2020
2023
2023

Publication Types

Select...
3
3

Relationship

0
6

Authors

Journals

citations
Cited by 12 publications
(11 citation statements)
references
References 30 publications
0
9
0
Order By: Relevance
“…There are works such as VeloC [9], DMTCP [10], CPPC [11], and SCR [12] which focus on simplifying or automating performant checkpointing. Further, there is much investigation into improving support for checkpointing across heterogeneous devices: including FTI [13] and CRUM [14]. These libraries each present a unique set of pros and cons that make the ideal choice highly dependent on application and environment details.…”
Section: Related Workmentioning
confidence: 99%
“…There are works such as VeloC [9], DMTCP [10], CPPC [11], and SCR [12] which focus on simplifying or automating performant checkpointing. Further, there is much investigation into improving support for checkpointing across heterogeneous devices: including FTI [13] and CRUM [14]. These libraries each present a unique set of pros and cons that make the ideal choice highly dependent on application and environment details.…”
Section: Related Workmentioning
confidence: 99%
“…Direct allocation: Uses the cudaMalloc API, and blocks the application's I/O operations until the reserved buffer on the device is fully allocated and mapped. While state-ofthe-art HPC checkpoint-restore runtimes such as FTI [27] rely on checkpointing directly to the host buffer, frameworks such as PyTorch [3] initialize the device buffer for checkpointing and/or staging data using this approach. Therefore, we consider this approach as the baseline for comparison against our proposed approach.…”
Section: B Compared Approachesmentioning
confidence: 99%
“…Parasysris et al 23 extend the checkpoint/restart library FTI by enabling it to create application‐level checkpoints for CUDA applications. The extended FTI creates checksums for GPU memory to support differential checkpoints, thereby decreasing the amount of memory transfers necessary.…”
Section: Related Workmentioning
confidence: 99%