Abstract:Fault tolerance is a major concern to guarantee availability of critical services as well as application execution. Traditional approaches for fault tolerance include checkpoint/restart or duplication. However it is also possible to anticipate failures and proactively take action before failures occur in order to minimize failure impact on the system and application execution.This document presents a proactive fault tolerance framework. This framework can use different proactive fault tolerance mechanisms, i.e… Show more
“…Based on when a response is initiated with respect to the occurrence of the failure, approaches can be classified as proactive and reactive. Proactive approaches predict failures of computing resources before they occur and then relocate a job executing on resources anticipated to fail onto resource that are not predicted to fail (for example [32,43,44] The control of a fault tolerant approach can be either centralised or distributed. In approaches where the control is centralised, one or more servers are used for backup and a single process responsible for monitoring jobs that are executed on a network of nodes.…”
Section: Discussionmentioning
confidence: 99%
“…It is not desirable to have to restart a job from the beginning if it has been executing for hours or days or months [6]. A key challenge in maintaining the seamless (or near seamless) execution of such jobs in the event of failures is addressed under research in fault tolerance [7,8,9,10].Many jobs rely on fault tolerant approaches that are implemented in the middleware supporting the job (for example [6,11,12,13]). The conventional fault tolerant mechanism supported by the middleware is checkpointing [14,15,16,17], which involves the periodic recording of intermediate states of execution of a job to which execution can be returned if a fault occurs.…”
Automating fault tolerance in high-performance computational biological jobs using multi-agent approaches Varghese, B., McKee, G., & Alexandrov, V. (2014). Automating fault tolerance in high-performance computational biological jobs using multi-agent approaches. General rights Copyright for the publications made accessible via the Queen's University Belfast Research Portal is retained by the author(s) and / or other copyright owners and it is a condition of accessing these publications that users recognise and abide by the legal requirements associated with these rights.Take down policy The Research Portal is Queen's institutional repository that provides access to Queen's research output. Every effort has been made to ensure that content in the Research Portal does not infringe any person's rights, or applicable UK laws. If you discover content in the Research Portal that you believe breaches copyright or violates any law, please contact openaccess@qub.ac.uk. Background: Large-scale biological jobs on high-performance computing systems require manual intervention if one or more computing cores on which they execute fail. This places not only a cost on the maintenance of the job, but also a cost on the time taken for reinstating the job and the risk of losing data and execution accomplished by the job before it failed. Approaches which can proactively detect computing core failures and take action to relocate the computing core's job onto reliable cores can make a significant step towards automating fault tolerance.Method: This paper describes an experimental investigation into the use of multi-agent approaches for fault tolerance. Two approaches are studied, the first at the job level and the second at the core level. The approaches are investigated for single core failure scenarios that can occur in the execution of parallel reduction algorithms on computer clusters. A third approach is proposed that incorporates multi-agent technology both at the job and core level. Experiments are pursued in the context of genome searching, a popular computational biology application.Result: The key conclusion is that the approaches proposed are feasible for automating fault tolerance in high-performance computing systems with minimal human intervention. In a typical experiment in which the fault tolerance is studied, centralised and decentralised checkpointing approaches on an average add 90% to the actual time for executing the job. On the other hand, in the same experiment the multi-agent approaches add only 10% to the overall execution time.high-performance computing | fault tolerance | biological jobs | multi-agents | seamless execution | checkpoint Introduction T he scale of resources and computations required for executing large-scale biological jobs are significantly increasing [1,2]. With this increase the resultant number of failures while running these jobs will also increase and the time between failures will decrease [3,4,5]. It is not desirable to have to restart a job from the beginning if it has been executin...
“…Based on when a response is initiated with respect to the occurrence of the failure, approaches can be classified as proactive and reactive. Proactive approaches predict failures of computing resources before they occur and then relocate a job executing on resources anticipated to fail onto resource that are not predicted to fail (for example [32,43,44] The control of a fault tolerant approach can be either centralised or distributed. In approaches where the control is centralised, one or more servers are used for backup and a single process responsible for monitoring jobs that are executed on a network of nodes.…”
Section: Discussionmentioning
confidence: 99%
“…It is not desirable to have to restart a job from the beginning if it has been executing for hours or days or months [6]. A key challenge in maintaining the seamless (or near seamless) execution of such jobs in the event of failures is addressed under research in fault tolerance [7,8,9,10].Many jobs rely on fault tolerant approaches that are implemented in the middleware supporting the job (for example [6,11,12,13]). The conventional fault tolerant mechanism supported by the middleware is checkpointing [14,15,16,17], which involves the periodic recording of intermediate states of execution of a job to which execution can be returned if a fault occurs.…”
Automating fault tolerance in high-performance computational biological jobs using multi-agent approaches Varghese, B., McKee, G., & Alexandrov, V. (2014). Automating fault tolerance in high-performance computational biological jobs using multi-agent approaches. General rights Copyright for the publications made accessible via the Queen's University Belfast Research Portal is retained by the author(s) and / or other copyright owners and it is a condition of accessing these publications that users recognise and abide by the legal requirements associated with these rights.Take down policy The Research Portal is Queen's institutional repository that provides access to Queen's research output. Every effort has been made to ensure that content in the Research Portal does not infringe any person's rights, or applicable UK laws. If you discover content in the Research Portal that you believe breaches copyright or violates any law, please contact openaccess@qub.ac.uk. Background: Large-scale biological jobs on high-performance computing systems require manual intervention if one or more computing cores on which they execute fail. This places not only a cost on the maintenance of the job, but also a cost on the time taken for reinstating the job and the risk of losing data and execution accomplished by the job before it failed. Approaches which can proactively detect computing core failures and take action to relocate the computing core's job onto reliable cores can make a significant step towards automating fault tolerance.Method: This paper describes an experimental investigation into the use of multi-agent approaches for fault tolerance. Two approaches are studied, the first at the job level and the second at the core level. The approaches are investigated for single core failure scenarios that can occur in the execution of parallel reduction algorithms on computer clusters. A third approach is proposed that incorporates multi-agent technology both at the job and core level. Experiments are pursued in the context of genome searching, a popular computational biology application.Result: The key conclusion is that the approaches proposed are feasible for automating fault tolerance in high-performance computing systems with minimal human intervention. In a typical experiment in which the fault tolerance is studied, centralised and decentralised checkpointing approaches on an average add 90% to the actual time for executing the job. On the other hand, in the same experiment the multi-agent approaches add only 10% to the overall execution time.high-performance computing | fault tolerance | biological jobs | multi-agents | seamless execution | checkpoint Introduction T he scale of resources and computations required for executing large-scale biological jobs are significantly increasing [1,2]. With this increase the resultant number of failures while running these jobs will also increase and the time between failures will decrease [3,4,5]. It is not desirable to have to restart a job from the beginning if it has been executin...
“…These two facets are integrated in approaches that combine prediction and migration in proactive FT systems and evaluate different FT policies. In [50], the authors provide a generic framework based on a modular architecture allowing the implementation of new proactive fault tolerance policies/mechanisms. An agent oriented framework [23] was developed for grid computing environments with separate agents to monitor individual classes or subclasses of faults and proactively act to avoid or tolerate a fault.…”
As the number of nodes in high-performance computing environments keeps increasing, faults are becoming common place. Reactive fault tolerance (FT) often does not scale due to massive I/O requirements and relies on manual job resubmission.This work complements reactive with proactive FT at the process level. Through health monitoring, a subset of node failures can be anticipated when one's health deteriorates. A novel process-level live migration mechanism supports continued execution of applications during much of processes migration. This scheme is integrated into an MPI execution environment to transparently sustain health-inflicted node failures, which eradicates the need to restart and requeue MPI jobs. Experiments indicate that 1-6.5 seconds of prior warning are required to successfully trigger live process migration while similar operating system virtualization mechanisms require 13-24 seconds. This self-healing approach complements reactive FT by nearly cutting the number of checkpoints in half when 70% of the faults are handled proactively.
“…Application reallocation was performed using a load balancer. Another Type 1 prototype [12] investigated coordination, protocols, and interfaces between individual system components.…”
“…Other initial work focused on a proactive FT framework [12], which combines both to perform prediction triggered migration. However, evaluation and comparison of individual solutions is very difficult at this early research stage due to missing realistic architectural models for the deployment of proactive FT technology in extreme-scale HPC systems.…”
Proactive fault tolerance (FT) in high-performance computing is a concept that prevents compute node failures from impacting running parallel applications by preemptively migrating application parts away from nodes that are about to fail. This paper provides a foundation for proactive FT by defining its architecture and classifying implementation options. This paper further relates prior work to the presented architecture and classification, and discusses the challenges ahead for needed supporting technologies.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.