Proactive process-level live migration and back migration in HPC environments

Wang, Chao; Mueller, Frank; Engelmann, Christian; Scott, Stephen L.

doi:10.1016/j.jpdc.2011.10.009

Cited by 42 publications

(31 citation statements)

References 47 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Before 2009, considerable research focused on how to avoid failures and their effects if failures could be predicted. Researchers explored the design and benefits of actions such as proactive migration of checkpointing [47,74,95]. The prediction of failures itself, however, was still an open issue.…”

Section: Failure Predictionmentioning

confidence: 99%

Toward Exascale Resilience: 2014 update

Cappello

Geist

Gropp

et al. 2014

JSFI

View full text Add to dashboard Cite

Resilience is a major roadblock for HPC executions on future exascale systems. These systems will typically gather millions of CPU cores running up to a billion threads. Projections from current large systems and technology evolution predict errors will happen in exascale systems many times per day. These errors will propagate and generate various kinds of malfunctions, from simple process crashes to result corruptions.The past five years have seen extraordinary technical progress in many domains related to exascale resilience. Several technical options, initially considered inapplicable or unrealistic in the HPC context, have demonstrated surprising successes. Despite this progress, the exascale resilience problem is not solved, and the community is still facing the difficult challenge of ensuring that exascale applications complete and generate correct results while running on unstable systems. Since 2009, many workshops, studies, and reports have improved the definition of the resilience problem and provided refined recommendations. Some projections made during the previous decades and some priorities established from these projections need to be revised. This paper surveys what the community has learned in the past five years and summarizes the research problems still considered critical by the HPC community.

show abstract

Section: Failure Predictionmentioning

confidence: 99%

Toward Exascale Resilience: 2014 update

Cappello

Geist

Gropp

et al. 2014

JSFI

View full text Add to dashboard Cite

show abstract

“…This work implements the checkpoint/restart procedure by transferring the process image to a healthy spare node for the purpose of resuming the process. Wang et al [12] proposed a process-level live migration mechanism to support continued execution of MPI processes. This work is integrated into an MPI execution environment to transparently sustain health-inflated node failures, which eradicates the need to restart and requeue MPI jobs.…”

Section: Related Workmentioning

confidence: 99%

Transparent Accelerator Migration in a Virtualized GPU Environment

Xiao

Balaji

Dinan

et al. 2012

2012 12th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (Ccgrid 2012)

View full text Add to dashboard Cite

Abstract-This paper presents a framework to support transparent, live migration of virtual GPU accelerators in a virtualized execution environment. Migration is a critical capability in such environments because it provides support for fault tolerance, ondemand system maintenance, resource management, and load balancing in the mapping of virtual to physical GPUs. Techniques to increase responsiveness and reduce migration overhead are explored. The system is evaluated by using four application kernels and is demonstrated to provide low migration overheads. Through transparent load balancing, our system provides a speedup of 1.7 to 1.9 for three of the four application kernels.

show abstract

“…Recent studies [21], [22], [23], [24], [25] apply execution migration techniques to enhance resource management by avoiding possible failures. They demonstrate the feasibility of exploiting proactive management methods for dependability assurance in networked computer systems.…”

Section: Related Workmentioning

confidence: 99%

A Failure Detection and Prediction Mechanism for Enhancing Dependability of Data Centers

Guan¹,

Zhang²,

Fu³

2012

IJCTE

View full text Add to dashboard Cite

Abstract-Modern data centers continue to grow in their scale and complexity. They are changing dynamically as well due to the addition and removal of system components, changing execution environments, frequent updates and upgrades, online repairs and more. Classical reliability theory and conventional methods do rarely consider the actual state of a system and are therefore not capable to reflect the dynamics of runtime systems and failure processes. In this paper, we present an unsupervised failure detection and prediction method using an ensemble of Bayesian models. It characterizes normal execution states of the system and detects anomalous behaviors. We implement a prototype of our failure detection and prediction mechanism and evaluate its performance on a data center test platform. Experimental results show that our proposed method can forecast failure dynamics with high accuracy.

show abstract

Proactive process-level live migration and back migration in HPC environments

Cited by 42 publications

References 47 publications

Toward Exascale Resilience: 2014 update

Toward Exascale Resilience: 2014 update

Transparent Accelerator Migration in a Virtualized GPU Environment

A Failure Detection and Prediction Mechanism for Enhancing Dependability of Data Centers

Contact Info

Product

Resources

About