2018 IEEE International Conference on Cluster Computing (CLUSTER) 2018
DOI: 10.1109/cluster.2018.00047
|View full text |Cite
|
Sign up to set email alerts
|

CRUM: Checkpoint-Restart Support for CUDA's Unified Memory

Abstract: Unified Virtual Memory (UVM) was recently introduced on recent NVIDIA GPUs. Through software and hardware support, UVM provides a coherent shared memory across the entire heterogeneous node, migrating data as appropriate. The older CUDA programming style is akin to older large-memory UNIX applications which used to directly load and unload memory segments. Newer CUDA programs have started taking advantage of UVM for the same reasons of superior programmability that UNIX applications long ago switched to assumi… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1

Citation Types

0
13
0

Year Published

2019
2019
2023
2023

Publication Types

Select...
5
3

Relationship

1
7

Authors

Journals

citations
Cited by 33 publications
(19 citation statements)
references
References 35 publications
(55 reference statements)
0
13
0
Order By: Relevance
“…Checkpointing GPU-enabled applications is difficult without a way to save the internal GPU state. Although some proxy-based approaches have been proposed [16], most actual implementations still rely on application-specific modification [43] which is not applicable to our study. Moreover using GPUs in Docker requires adhoc solutions such as NVIDIA-Docker that do not support checkpointing currently.…”
Section: Related Workmentioning
confidence: 99%
“…Checkpointing GPU-enabled applications is difficult without a way to save the internal GPU state. Although some proxy-based approaches have been proposed [16], most actual implementations still rely on application-specific modification [43] which is not applicable to our study. Moreover using GPUs in Docker requires adhoc solutions such as NVIDIA-Docker that do not support checkpointing currently.…”
Section: Related Workmentioning
confidence: 99%
“…The availability of effective and scalable checkpointing techniques for accelerators is thus essential for emerging exascale systems. Initial contributions [ 130 , 149 , 166 ] do not support features that are normally available in recent devices, such as NVIDIA Unified Virtual Addressing (UVA), as pointed out in Reference [ 74 ]. The decoupled CP U-GP U architecture poses additional technical challenges for effective application-wide checkpointing.…”
Section: 21mentioning
confidence: 99%
“…Other works, introduce some form of GPU checkpointing like HiAL-Ckpt [ 181 ], HeteroCheckpoint [ 102 ], and cudaCR [ 139 ], taking an application-specific approach to provide GPU-side checkpointing. Last, the CRUM framework presented in Reference [ 74 ], which also relies on a proxy-based approach along with new shadow page synchronization mechanisms, directly addresses the support for CUDA's unified virtual memory (UVM) available in the latest device generations, enabling fast asynchronous checkpointing for large-memory CUDA UVM applications and significantly reducing checkpointing overheads. While all the above contributions address GPU devices, FPGAs have emerged during recent years as an alternative for dedicated acceleration matching a few specific types of HPC workloads.…”
Section: 21mentioning
confidence: 99%
“…Data transmission between the two GPUs in the kernel function is completed through the unified memory (Landaverde et al, 2014) and the peer to peer (P2P) transmission mode between the two GPUs shown in Figure 13 (Garg et al, 2018). These technologies hide data transmission between the GPUs.…”
Section: Gpu Algorithm and Multistrategy Optimizationmentioning
confidence: 99%