Abstract:As we approach the era of exa-scale computing, fault tolerance is of growing importance. The increasing number of cores as well as the increased complexity of modern heterogenous systems result in substantial decrease of the expected mean time between failures. Among the different fault tolerance techniques, checkpoint/restart is vastly adopted in supercomputing systems. Although many supercomputers in the TOP 500 list use GPUs, only a few checkpoint restart mechanism support GPUs. In this paper, we extend an … Show more
“…There are works such as VeloC [9], DMTCP [10], CPPC [11], and SCR [12] which focus on simplifying or automating performant checkpointing. Further, there is much investigation into improving support for checkpointing across heterogeneous devices: including FTI [13] and CRUM [14]. These libraries each present a unique set of pros and cons that make the ideal choice highly dependent on application and environment details.…”
Integrating recent advancements in resilient algorithms and techniques into existing codes is a singular challenge in fault tolerance -in part due to the underlying complexity of implementing resilience in the first place, but also due to the difficulty introduced when integrating the functionality of a standalone new strategy with the preexisting resilience layers of an application. We propose that the answer is not to build integrated solutions for users, but runtimes designed to integrate into a larger comprehensive resilience system and thereby enable the necessary jump to multi-layered recovery. Our work designs, implements, and verifies one such comprehensive system of runtimes. Utilizing Fenix, a process resilience tool with integration into preexisting resilience systems as a design priority, we update Kokkos Resilience and the use pattern of VeloC to support application-level integration of resilience runtimes. Our work shows that designing integrable systems rather than integrated systems allows for user-designed optimization and upgrading of resilience techniques while maintaining the simplicity and performance of all-in-one resilience solutions. More applicationspecific choice in resilience strategies allows for better long-term flexibility, performance, and -importantly -simplicity.
“…There are works such as VeloC [9], DMTCP [10], CPPC [11], and SCR [12] which focus on simplifying or automating performant checkpointing. Further, there is much investigation into improving support for checkpointing across heterogeneous devices: including FTI [13] and CRUM [14]. These libraries each present a unique set of pros and cons that make the ideal choice highly dependent on application and environment details.…”
Integrating recent advancements in resilient algorithms and techniques into existing codes is a singular challenge in fault tolerance -in part due to the underlying complexity of implementing resilience in the first place, but also due to the difficulty introduced when integrating the functionality of a standalone new strategy with the preexisting resilience layers of an application. We propose that the answer is not to build integrated solutions for users, but runtimes designed to integrate into a larger comprehensive resilience system and thereby enable the necessary jump to multi-layered recovery. Our work designs, implements, and verifies one such comprehensive system of runtimes. Utilizing Fenix, a process resilience tool with integration into preexisting resilience systems as a design priority, we update Kokkos Resilience and the use pattern of VeloC to support application-level integration of resilience runtimes. Our work shows that designing integrable systems rather than integrated systems allows for user-designed optimization and upgrading of resilience techniques while maintaining the simplicity and performance of all-in-one resilience solutions. More applicationspecific choice in resilience strategies allows for better long-term flexibility, performance, and -importantly -simplicity.
“…Direct allocation: Uses the cudaMalloc API, and blocks the application's I/O operations until the reserved buffer on the device is fully allocated and mapped. While state-ofthe-art HPC checkpoint-restore runtimes such as FTI [27] rely on checkpointing directly to the host buffer, frameworks such as PyTorch [3] initialize the device buffer for checkpointing and/or staging data using this approach. Therefore, we consider this approach as the baseline for comparison against our proposed approach.…”
“…Parasysris et al 23 extend the checkpoint/restart library FTI by enabling it to create application‐level checkpoints for CUDA applications. The extended FTI creates checksums for GPU memory to support differential checkpoints, thereby decreasing the amount of memory transfers necessary.…”
In high-performance computing and cloud computing the introduction of heterogeneous computing resources, such as GPU accelerator have led to a dramatic increase in performance and efficiency. While the benefits of virtualization features in these environments are well researched, GPUs do not offer virtualization support that enables fine-grained control, increased flexibility, and fault tolerance. In this article, we present Cricket: A transparent and low-overhead solution to GPU virtualization that enables future research into other virtualization techniques, due to its open-source nature.Cricket supports remote execution and checkpoint/restart of CUDA applications. Both features enable the distribution of GPU tasks dynamically and flexibly across computing nodes and the multitenant usage of GPU resources, thereby improving flexibility and utilization for high-performance and cloud computing.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.