Motor: A Virtual Machine for High Performance Computing

Goscinski, Wojtek; Abramson, David

doi:10.1109/hpdc.2006.1652148

Cited by 8 publications

(13 citation statements)

References 13 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Recently, interest in using virtual machines (VMs) as the abstraction for distributed and parallel computing in general has been growing [20,24,13]. Virtual machine monitors, such as Xen [5] and VMware [46], have the potential to greatly simplify management from the perspective of resource owners and to provide great flexibility to resource users.…”

Section: Related Workmentioning

confidence: 99%

See 1 more Smart Citation

Failure-aware resource management for high-availability computing clusters with distributed virtual machines

2010

Journal of Parallel and Distributed Computing

View full text Add to dashboard Cite

Section: Related Workmentioning

confidence: 99%

“…They provide desirable features to meet demanding requirements of computing resources in networked computing systems [20,24,13]. A most important feature provided by modern VM techniques is their ability of reconfiguration through VM migration [8,41].…”

Section: Introductionmentioning

confidence: 99%

Failure-aware resource management for high-availability computing clusters with distributed virtual machines

2010

Journal of Parallel and Distributed Computing

View full text Add to dashboard Cite

“…Most of existing works on virtual machine based HPC focus on infrastructure design [7,11,3] and reducing virtualization overhead [20,15,4]. However, few has investigated the reliability issue, such as avoiding, coping and recovering from failures, which is one of the hardest problems in HPC systems.…”

Section: Introductionmentioning

confidence: 99%

“…They provide desirable features to meet demanding requirements of computing resources in modern computing systems. Performance isolation, ease of management, checkpointing, migration, OS customization, and security are among the many features that make VM a powerful and popular technique for high performance computing (HPC) [7,10,11].…”

Section: Introductionmentioning

confidence: 99%

Proactive Resource Management for Failure Resilient High Performance Computing Clusters

2009

2009 International Conference on Availability, Reliability and Security

View full text Add to dashboard Cite

Virtual machine (VM) technology provides an additional layer of abstraction for resource management in highperformance computing (HPC) systems. In large-scale computing clusters, component failures become norms instead of exceptions, caused by the ever-increasing system complexity. VM construction and reconfiguration is a potent tool for efficient online system maintenance and failure resilience. In this paper, we study how VM-based HPC clusters benefits from failure prediction in resource management for dependable computing. We consider both the reliability and performance status of compute nodes in making selection decisions. We define a capacity-reliability metric to combine the effects of both factors, and propose the Best-fit algorithm to find the best qualified nodes on which to instantiate VMs to run user jobs. We have conducted experiments using failure traces from the Los Alamos National Laboratory (LANL) HPC clusters. The results show the enhancement of system dependability by using our proposed strategy with practically achievable accuracy of failure prediction. With the Best-fit strategies, the job completion rate is increased by 10.5% compared with that achieved in the current LANL HPC cluster. The task completion rate reaches 82.5% with improved utilization of relatively unreliable nodes.

show abstract

“…They provide desirable features to meet demanding requirements of computing resources in cluster computing [10,13,5]. A most important feature provided by modern VM techniques is their ability of reconfiguration through VM migration [3].…”

Section: Introductionmentioning

confidence: 99%

Failure-Aware Construction and Reconfiguration of Distributed Virtual Machines for High Availability Computing

2009

2009 9th IEEE/ACM International Symposium on Cluster Computing and the Grid

View full text Add to dashboard Cite

In large-scale clusters and computational grids, component failures become norms instead of exceptions. Failure occurrence as well as its impact on system performance and operation costs have become an increasingly important concern to system designers and administrators. In this paper, we study how to efficiently utilize system resources for high-availability clusters with the support of the virtual machine (VM) technology. We design a reconfigurable distributed virtual machine (RDVM) infrastructure for clusters computing. We propose failure-aware node selection strategies for the construction and reconfiguration of RDVMs. We leverage the proactive failure management techniques in calculating nodes' reliability status. We consider both the performance and reliability status of compute nodes in making selection decisions. We define a capacity-reliability metric to combine the effects of both factors in node selection, and propose Best-fit algorithms to find the best qualified nodes on which to instantiate VMs to run parallel jobs. We have conducted experiments using failure traces from production clusters and the NAS Parallel Benchmark programs on a real cluster. The results show the enhancement of system productivity and dependability by using the proposed strategies. With the Best-fit strategies, the job completion rate is increased by 17.6% compared with that achieved in the current LANL HPC cluster, and the task completion rate reaches 91.7%.

show abstract

Motor: A Virtual Machine for High Performance Computing

Cited by 8 publications

References 13 publications

Failure-aware resource management for high-availability computing clusters with distributed virtual machines

Failure-aware resource management for high-availability computing clusters with distributed virtual machines

Proactive Resource Management for Failure Resilient High Performance Computing Clusters

Failure-Aware Construction and Reconfiguration of Distributed Virtual Machines for High Availability Computing

Contact Info

Product

Resources

About