Checkpoint Restart Support for Heterogeneous HPC Applications

Parasyris, Konstantinos; Keller, Kai; Bautista-Gomez, Leonardo; Ünsal, Osman

doi:10.1109/ccgrid49817.2020.00-69

Cited by 12 publications

(11 citation statements)

References 30 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…There are works such as VeloC [9], DMTCP [10], CPPC [11], and SCR [12] which focus on simplifying or automating performant checkpointing. Further, there is much investigation into improving support for checkpointing across heterogeneous devices: including FTI [13] and CRUM [14]. These libraries each present a unique set of pros and cons that make the ideal choice highly dependent on application and environment details.…”

Section: Related Workmentioning

confidence: 99%

Integrating process, control-flow, and data resiliency layers using a hybrid Fenix/Kokkos approach

Whitlock

Morales

Bosilca

et al. 2022

2022 IEEE International Conference on Cluster Computing (CLUSTER)

View full text Add to dashboard Cite

Integrating recent advancements in resilient algorithms and techniques into existing codes is a singular challenge in fault tolerance -in part due to the underlying complexity of implementing resilience in the first place, but also due to the difficulty introduced when integrating the functionality of a standalone new strategy with the preexisting resilience layers of an application. We propose that the answer is not to build integrated solutions for users, but runtimes designed to integrate into a larger comprehensive resilience system and thereby enable the necessary jump to multi-layered recovery. Our work designs, implements, and verifies one such comprehensive system of runtimes. Utilizing Fenix, a process resilience tool with integration into preexisting resilience systems as a design priority, we update Kokkos Resilience and the use pattern of VeloC to support application-level integration of resilience runtimes. Our work shows that designing integrable systems rather than integrated systems allows for user-designed optimization and upgrading of resilience techniques while maintaining the simplicity and performance of all-in-one resilience solutions. More applicationspecific choice in resilience strategies allows for better long-term flexibility, performance, and -importantly -simplicity.

show abstract

Section: Related Workmentioning

confidence: 99%

Integrating process, control-flow, and data resiliency layers using a hybrid Fenix/Kokkos approach

Whitlock

Morales

Bosilca

et al. 2022

2022 IEEE International Conference on Cluster Computing (CLUSTER)

View full text Add to dashboard Cite

show abstract

“…Direct allocation: Uses the cudaMalloc API, and blocks the application's I/O operations until the reserved buffer on the device is fully allocated and mapped. While state-ofthe-art HPC checkpoint-restore runtimes such as FTI [27] rely on checkpointing directly to the host buffer, frameworks such as PyTorch [3] initialize the device buffer for checkpointing and/or staging data using this approach. Therefore, we consider this approach as the baseline for comparison against our proposed approach.…”

Section: B Compared Approachesmentioning

confidence: 99%

Towards Efficient Cache Allocation for High-Frequency Checkpointing

Maurya

Nicolae

Rafique

et al. 2022

2022 IEEE 29th International Conference on High Performance Computing, Data, and Analytics (HiPC)

View full text Add to dashboard Cite

“…Parasysris et al 23 extend the checkpoint/restart library FTI by enabling it to create application‐level checkpoints for CUDA applications. The extended FTI creates checksums for GPU memory to support differential checkpoints, thereby decreasing the amount of memory transfers necessary.…”

Section: Related Workmentioning

confidence: 99%

Cricket: A virtualization layer for distributed execution of CUDA applications with checkpoint/restart support

Eiling

Baude

Lankes

et al. 2021

Concurrency and Computation

View full text Add to dashboard Cite

In high-performance computing and cloud computing the introduction of heterogeneous computing resources, such as GPU accelerator have led to a dramatic increase in performance and efficiency. While the benefits of virtualization features in these environments are well researched, GPUs do not offer virtualization support that enables fine-grained control, increased flexibility, and fault tolerance. In this article, we present Cricket: A transparent and low-overhead solution to GPU virtualization that enables future research into other virtualization techniques, due to its open-source nature.Cricket supports remote execution and checkpoint/restart of CUDA applications. Both features enable the distribution of GPU tasks dynamically and flexibly across computing nodes and the multitenant usage of GPU resources, thereby improving flexibility and utilization for high-performance and cloud computing.

show abstract

Checkpoint Restart Support for Heterogeneous HPC Applications

Cited by 12 publications

References 30 publications

Integrating process, control-flow, and data resiliency layers using a hybrid Fenix/Kokkos approach

Integrating process, control-flow, and data resiliency layers using a hybrid Fenix/Kokkos approach

Towards Efficient Cache Allocation for High-Frequency Checkpointing

Cricket: A virtualization layer for distributed execution of CUDA applications with checkpoint/restart support

Contact Info

Product

Resources

About