CRUM: Checkpoint-Restart Support for CUDA's Unified Memory

Garg, Rohan; Mohan, Apoorve; Sullivan, Michael; Cooperman, Gene

doi:10.1109/cluster.2018.00047

Cited by 33 publications

(19 citation statements)

References 35 publications

(55 reference statements)

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Checkpointing GPU-enabled applications is difficult without a way to save the internal GPU state. Although some proxy-based approaches have been proposed [16], most actual implementations still rely on application-specific modification [43] which is not applicable to our study. Moreover using GPUs in Docker requires adhoc solutions such as NVIDIA-Docker that do not support checkpointing currently.…”

Section: Related Workmentioning

confidence: 99%

Profiles of Upcoming HPC Applications and Their Impact on Reservation Strategies

Gainaru

Goglin

Honoré

et al. 2021

IEEE Trans. Parallel Distrib. Syst.

View full text Add to dashboard Cite

With the expected convergence between HPC, BigData and AI, new applications with different profiles are coming to HPC infrastructures. We aim at better understanding the features and needs of these applications in order to be able to run them efficiently on HPC platforms. The approach followed is bottom-up: we study thoroughly an emerging application, Spatially Localized Atlas Network Tiles (SLANT, originating from the neuroscience community) to understand its behavior. Based on these observations, we derive a generic, yet simple, application model (namely, a linear sequence of stochastic jobs). We expect this model to be representative for a large set of upcoming applications that require the computational power of HPC clusters without fitting the typical behavior of large-scale traditional applications. In a second step, we show how one can manipulate this generic model in a scheduling framework. Specifically we consider the problem of making reservations (both time and memory) for an execution on an HPC platform. We derive solutions using the model of the first step of this work. We experimentally show the robustness of the model, even with very few data or with another application, to generate the model, and provide performance gains with regards to standard and more recent approaches used in the neuroscience community.

show abstract

Section: Related Workmentioning

confidence: 99%

Profiles of Upcoming HPC Applications and Their Impact on Reservation Strategies

Gainaru

Goglin

Honoré

et al. 2021

IEEE Trans. Parallel Distrib. Syst.

View full text Add to dashboard Cite

show abstract

“…The availability of effective and scalable checkpointing techniques for accelerators is thus essential for emerging exascale systems. Initial contributions [ 130 , 149 , 166 ] do not support features that are normally available in recent devices, such as NVIDIA Unified Virtual Addressing (UVA), as pointed out in Reference [ 74 ]. The decoupled CP U-GP U architecture poses additional technical challenges for effective application-wide checkpointing.…”

Section: 21mentioning

confidence: 99%

“…Other works, introduce some form of GPU checkpointing like HiAL-Ckpt [ 181 ], HeteroCheckpoint [ 102 ], and cudaCR [ 139 ], taking an application-specific approach to provide GPU-side checkpointing. Last, the CRUM framework presented in Reference [ 74 ], which also relies on a proxy-based approach along with new shadow page synchronization mechanisms, directly addresses the support for CUDA's unified virtual memory (UVM) available in the latest device generations, enabling fast asynchronous checkpointing for large-memory CUDA UVM applications and significantly reducing checkpointing overheads. While all the above contributions address GPU devices, FPGAs have emerged during recent years as an alternative for dedicated acceleration matching a few specific types of HPC workloads.…”

Section: 21mentioning

confidence: 99%

Predictive Reliability and Fault Management in Exascale Systems

et al. 2020

View full text Add to dashboard Cite

Performance and power constraints come together with Complementary Metal Oxide Semiconductor technology scaling in future Exascale systems. Technology scaling makes each individual transistor more prone to faults and, due to the exponential increase in the number of devices per chip, to higher system fault rates. Consequently, High-performance Computing (HPC) systems need to integrate prediction, detection, and recovery mechanisms to cope with faults efficiently. This article reviews fault detection, fault prediction, and recovery techniques in HPC systems, from electronics to system level. We analyze their strengths and limitations. Finally, we identify the promising paths to meet the reliability levels of Exascale systems.

show abstract

“…Data transmission between the two GPUs in the kernel function is completed through the unified memory (Landaverde et al, 2014) and the peer to peer (P2P) transmission mode between the two GPUs shown in Figure 13 (Garg et al, 2018). These technologies hide data transmission between the GPUs.…”

Section: Gpu Algorithm and Multistrategy Optimizationmentioning

confidence: 99%

Efficient graphic processing unit implementation of the chemical-potential multiphase lattice Boltzmann method

Zhu

Zhang

et al. 2020

The International Journal of High Performance Computing Applica

View full text Add to dashboard Cite

The chemical-potential multiphase lattice Boltzmann method (CP-LBM) has the advantages of satisfying the thermodynamic consistency and Galilean invariance, and it realizes a very large density ratio and easily expresses the surface wettability. Compared with the traditional central difference scheme, the CP-LBM uses the Thomas algorithm to calculate the differences in the multiphase simulations, which significantly improves the calculation accuracy but increases the calculation complexity. In this study, we designed and implemented a parallel algorithm for the chemical-potential model on a graphic processing unit (GPU). Several strategies were used to optimize the GPU algorithm, such as coalesced access, instruction throughput, thread organization, memory access, and loop unrolling. Compared with dual-Xeon 5117 CPU server, our methods achieved 95 times speedup on an NVIDIA RTX 2080Ti GPU and 106 times speedup on an NVIDIA Tesla P100 GPU. When the algorithm was extended to the environment with dual NVIDIA Tesla P100 GPUs, 189 times speedup was achieved and the workload of each GPU reached 96%.

show abstract

CRUM: Checkpoint-Restart Support for CUDA's Unified Memory

Cited by 33 publications

References 35 publications

Profiles of Upcoming HPC Applications and Their Impact on Reservation Strategies

Profiles of Upcoming HPC Applications and Their Impact on Reservation Strategies

Predictive Reliability and Fault Management in Exascale Systems

Efficient graphic processing unit implementation of the chemical-potential multiphase lattice Boltzmann method

Contact Info

Product

Resources

About