2013
DOI: 10.2991/ijndc.2013.1.4.2
|View full text |Cite
|
Sign up to set email alerts
|

A Checkpoint/Restart Scheme for CUDA Programs with Complex Computation States

Abstract: Checkpoint/restart has been an effective mechanism to achieve fault tolerance for many long-running scientific applications. The common approach is to save computation states in memory and secondary storage for execution resumption. However, as the GPU plays a much bigger role in high performance computing, there is no effective checkpoint/restart scheme yet due to the difficulty of the GPU computation state handling. This paper proposes an application-level checkpoint/restart scheme to save and restore GPU co… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1

Citation Types

0
1
0

Year Published

2020
2020
2023
2023

Publication Types

Select...
3
2

Relationship

1
4

Authors

Journals

citations
Cited by 7 publications
(1 citation statement)
references
References 27 publications
0
1
0
Order By: Relevance
“…Despite this, fine-grained migrations in the GPU kernel have also been proposed several times. In this article, the design of Jiang et al 26 was summarized in detail, where the source code was divided into multiple parts according to the given checkpoint. Until the computation was finished, the kernel would not stop being launched circularly.…”
Section: Related Workmentioning
confidence: 99%
“…Despite this, fine-grained migrations in the GPU kernel have also been proposed several times. In this article, the design of Jiang et al 26 was summarized in detail, where the source code was divided into multiple parts according to the given checkpoint. Until the computation was finished, the kernel would not stop being launched circularly.…”
Section: Related Workmentioning
confidence: 99%