2012
DOI: 10.1016/j.jpdc.2011.10.009
|View full text |Cite
|
Sign up to set email alerts
|

Proactive process-level live migration and back migration in HPC environments

Abstract: As the number of nodes in high-performance computing environments keeps increasing, faults are becoming common place. Reactive fault tolerance (FT) often does not scale due to massive I/O requirements and relies on manual job resubmission.This work complements reactive with proactive FT at the process level. Through health monitoring, a subset of node failures can be anticipated when one's health deteriorates. A novel process-level live migration mechanism supports continued execution of applications during mu… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

0
31
0

Year Published

2012
2012
2021
2021

Publication Types

Select...
4
3
3

Relationship

1
9

Authors

Journals

citations
Cited by 42 publications
(31 citation statements)
references
References 47 publications
0
31
0
Order By: Relevance
“…Before 2009, considerable research focused on how to avoid failures and their effects if failures could be predicted. Researchers explored the design and benefits of actions such as proactive migration of checkpointing [47,74,95]. The prediction of failures itself, however, was still an open issue.…”
Section: Failure Predictionmentioning
confidence: 99%
“…Before 2009, considerable research focused on how to avoid failures and their effects if failures could be predicted. Researchers explored the design and benefits of actions such as proactive migration of checkpointing [47,74,95]. The prediction of failures itself, however, was still an open issue.…”
Section: Failure Predictionmentioning
confidence: 99%
“…This work implements the checkpoint/restart procedure by transferring the process image to a healthy spare node for the purpose of resuming the process. Wang et al [12] proposed a process-level live migration mechanism to support continued execution of MPI processes. This work is integrated into an MPI execution environment to transparently sustain health-inflated node failures, which eradicates the need to restart and requeue MPI jobs.…”
Section: Related Workmentioning
confidence: 99%
“…Recent studies [21], [22], [23], [24], [25] apply execution migration techniques to enhance resource management by avoiding possible failures. They demonstrate the feasibility of exploiting proactive management methods for dependability assurance in networked computer systems.…”
Section: Related Workmentioning
confidence: 99%