A Framework for Proactive Fault Tolerance

Vallée, Geoffroy; Engelmann, Christian; Tikotekar, Anand; Naughton, Thomas J.; Charoenpornwattana, K.; Leangsuksun, Chokchai; Scott, Stephen L.

doi:10.1109/ares.2008.171

Cited by 53 publications

(21 citation statements)

References 4 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Based on when a response is initiated with respect to the occurrence of the failure, approaches can be classified as proactive and reactive. Proactive approaches predict failures of computing resources before they occur and then relocate a job executing on resources anticipated to fail onto resource that are not predicted to fail (for example [32,43,44] The control of a fault tolerant approach can be either centralised or distributed. In approaches where the control is centralised, one or more servers are used for backup and a single process responsible for monitoring jobs that are executed on a network of nodes.…”

Section: Discussionmentioning

confidence: 99%

“…It is not desirable to have to restart a job from the beginning if it has been executing for hours or days or months [6]. A key challenge in maintaining the seamless (or near seamless) execution of such jobs in the event of failures is addressed under research in fault tolerance [7,8,9,10].Many jobs rely on fault tolerant approaches that are implemented in the middleware supporting the job (for example [6,11,12,13]). The conventional fault tolerant mechanism supported by the middleware is checkpointing [14,15,16,17], which involves the periodic recording of intermediate states of execution of a job to which execution can be returned if a fault occurs.…”

mentioning

confidence: 99%

See 1 more Smart Citation

Automating fault tolerance in high-performance computational biological jobs using multi-agent approaches

Varghese

McKee

Alexandrov

2014

Computers in Biology and Medicine

View full text Add to dashboard Cite

Automating fault tolerance in high-performance computational biological jobs using multi-agent approaches Varghese, B., McKee, G., & Alexandrov, V. (2014). Automating fault tolerance in high-performance computational biological jobs using multi-agent approaches. General rights Copyright for the publications made accessible via the Queen's University Belfast Research Portal is retained by the author(s) and / or other copyright owners and it is a condition of accessing these publications that users recognise and abide by the legal requirements associated with these rights.Take down policy The Research Portal is Queen's institutional repository that provides access to Queen's research output. Every effort has been made to ensure that content in the Research Portal does not infringe any person's rights, or applicable UK laws. If you discover content in the Research Portal that you believe breaches copyright or violates any law, please contact openaccess@qub.ac.uk. Background: Large-scale biological jobs on high-performance computing systems require manual intervention if one or more computing cores on which they execute fail. This places not only a cost on the maintenance of the job, but also a cost on the time taken for reinstating the job and the risk of losing data and execution accomplished by the job before it failed. Approaches which can proactively detect computing core failures and take action to relocate the computing core's job onto reliable cores can make a significant step towards automating fault tolerance.Method: This paper describes an experimental investigation into the use of multi-agent approaches for fault tolerance. Two approaches are studied, the first at the job level and the second at the core level. The approaches are investigated for single core failure scenarios that can occur in the execution of parallel reduction algorithms on computer clusters. A third approach is proposed that incorporates multi-agent technology both at the job and core level. Experiments are pursued in the context of genome searching, a popular computational biology application.Result: The key conclusion is that the approaches proposed are feasible for automating fault tolerance in high-performance computing systems with minimal human intervention. In a typical experiment in which the fault tolerance is studied, centralised and decentralised checkpointing approaches on an average add 90% to the actual time for executing the job. On the other hand, in the same experiment the multi-agent approaches add only 10% to the overall execution time.high-performance computing | fault tolerance | biological jobs | multi-agents | seamless execution | checkpoint Introduction T he scale of resources and computations required for executing large-scale biological jobs are significantly increasing [1,2]. With this increase the resultant number of failures while running these jobs will also increase and the time between failures will decrease [3,4,5]. It is not desirable to have to restart a job from the beginning if it has been executin...

show abstract

Section: Discussionmentioning

confidence: 99%

mentioning

confidence: 99%

Automating fault tolerance in high-performance computational biological jobs using multi-agent approaches

Varghese

McKee

Alexandrov

2014

Computers in Biology and Medicine

View full text Add to dashboard Cite

show abstract

“…These two facets are integrated in approaches that combine prediction and migration in proactive FT systems and evaluate different FT policies. In [50], the authors provide a generic framework based on a modular architecture allowing the implementation of new proactive fault tolerance policies/mechanisms. An agent oriented framework [23] was developed for grid computing environments with separate agents to monitor individual classes or subclasses of faults and proactively act to avoid or tolerate a fault.…”

Section: Related Workmentioning

confidence: 99%

Proactive process-level live migration and back migration in HPC environments

Wang

Mueller

Engelmann

et al. 2012

Journal of Parallel and Distributed Computing

Self Cite

View full text Add to dashboard Cite

As the number of nodes in high-performance computing environments keeps increasing, faults are becoming common place. Reactive fault tolerance (FT) often does not scale due to massive I/O requirements and relies on manual job resubmission.This work complements reactive with proactive FT at the process level. Through health monitoring, a subset of node failures can be anticipated when one's health deteriorates. A novel process-level live migration mechanism supports continued execution of applications during much of processes migration. This scheme is integrated into an MPI execution environment to transparently sustain health-inflicted node failures, which eradicates the need to restart and requeue MPI jobs. Experiments indicate that 1-6.5 seconds of prior warning are required to successfully trigger live process migration while similar operating system virtualization mechanisms require 13-24 seconds. This self-healing approach complements reactive FT by nearly cutting the number of checkpoints in half when 70% of the faults are handled proactively.

show abstract

“…Application reallocation was performed using a load balancer. Another Type 1 prototype [12] investigated coordination, protocols, and interfaces between individual system components.…”

Section: Proactive Fault Tolerance Frameworkmentioning

confidence: 99%

“…Other initial work focused on a proactive FT framework [12], which combines both to perform prediction triggered migration. However, evaluation and comparison of individual solutions is very difficult at this early research stage due to missing realistic architectural models for the deployment of proactive FT technology in extreme-scale HPC systems.…”

Section: Introductionmentioning

confidence: 99%

Proactive Fault Tolerance Using Preemptive Migration

Engelmann

Vallée

Naughton

et al. 2009

2009 17th Euromicro International Conference on Parallel, Distributed and Network-Based Processing

Self Cite

View full text Add to dashboard Cite

Proactive fault tolerance (FT) in high-performance computing is a concept that prevents compute node failures from impacting running parallel applications by preemptively migrating application parts away from nodes that are about to fail. This paper provides a foundation for proactive FT by defining its architecture and classifying implementation options. This paper further relates prior work to the presented architecture and classification, and discusses the challenges ahead for needed supporting technologies.

show abstract

A Framework for Proactive Fault Tolerance

Cited by 53 publications

References 4 publications

Automating fault tolerance in high-performance computational biological jobs using multi-agent approaches

Automating fault tolerance in high-performance computational biological jobs using multi-agent approaches

Proactive process-level live migration and back migration in HPC environments

Proactive Fault Tolerance Using Preemptive Migration

Contact Info

Product

Resources

About