Distributed diagnosis in dynamic fault environments

Subbiah, Arun; Blough, Douglas M.

doi:10.1109/tpds.2004.1278102

Cited by 50 publications

(41 citation statements)

References 25 publications

Supporting

Mentioning

Contrasting

Unclassified

Order By: Relevance

“…Such approaches are susceptible to single point failure, lack scalability over a large network of nodes, have large overheads, and require large disk storage. These drawbacks can be minimised or avoided when the control of the approaches is distributed (for example, distributed diagnosis [50], distributed checkpointing [41] and diskless checkpointing [51]). …”

Section: Discussionmentioning

confidence: 99%

Automating fault tolerance in high-performance computational biological jobs using multi-agent approaches

Varghese

McKee

Alexandrov

2014

Computers in Biology and Medicine

View full text Add to dashboard Cite

Automating fault tolerance in high-performance computational biological jobs using multi-agent approaches Varghese, B., McKee, G., & Alexandrov, V. (2014). Automating fault tolerance in high-performance computational biological jobs using multi-agent approaches. General rights Copyright for the publications made accessible via the Queen's University Belfast Research Portal is retained by the author(s) and / or other copyright owners and it is a condition of accessing these publications that users recognise and abide by the legal requirements associated with these rights.Take down policy The Research Portal is Queen's institutional repository that provides access to Queen's research output. Every effort has been made to ensure that content in the Research Portal does not infringe any person's rights, or applicable UK laws. If you discover content in the Research Portal that you believe breaches copyright or violates any law, please contact openaccess@qub.ac.uk. Background: Large-scale biological jobs on high-performance computing systems require manual intervention if one or more computing cores on which they execute fail. This places not only a cost on the maintenance of the job, but also a cost on the time taken for reinstating the job and the risk of losing data and execution accomplished by the job before it failed. Approaches which can proactively detect computing core failures and take action to relocate the computing core's job onto reliable cores can make a significant step towards automating fault tolerance.Method: This paper describes an experimental investigation into the use of multi-agent approaches for fault tolerance. Two approaches are studied, the first at the job level and the second at the core level. The approaches are investigated for single core failure scenarios that can occur in the execution of parallel reduction algorithms on computer clusters. A third approach is proposed that incorporates multi-agent technology both at the job and core level. Experiments are pursued in the context of genome searching, a popular computational biology application.Result: The key conclusion is that the approaches proposed are feasible for automating fault tolerance in high-performance computing systems with minimal human intervention. In a typical experiment in which the fault tolerance is studied, centralised and decentralised checkpointing approaches on an average add 90% to the actual time for executing the job. On the other hand, in the same experiment the multi-agent approaches add only 10% to the overall execution time.high-performance computing | fault tolerance | biological jobs | multi-agents | seamless execution | checkpoint Introduction T he scale of resources and computations required for executing large-scale biological jobs are significantly increasing [1,2]. With this increase the resultant number of failures while running these jobs will also increase and the time between failures will decrease [3,4,5]. It is not desirable to have to restart a job from the beginning if it has been executin...

show abstract

Section: Discussionmentioning

confidence: 99%

Automating fault tolerance in high-performance computational biological jobs using multi-agent approaches

Varghese

McKee

Alexandrov

2014

Computers in Biology and Medicine

View full text Add to dashboard Cite

show abstract

“…A lot of research in distributed diagnosis has considered computer systems connected with a network, for example in [12], a Hierarchical Adaptive Distributed System-level Diagnosis algorithm is presented which allows every fault-free node to achieve diagnosis in maximum (log 2 N) 2 testing rounds. In [13], a diagnostic Algorithm HeartbeatComplete is presented, which offers bounded correctness in fullyconnected systems while simultaneously minimizing diagnostic latency, start-up time, and state holding time. In [14], a rule based diagnosis through a monitor system for diagnosis in large scale network protocols is discussed.…”

Section: Contextmentioning

confidence: 99%

Wireless Network Architecture for Diagnosis and Monitoring Applications

Khan

Thiriet

Genon-Catalot

2009

2009 6th IEEE Consumer Communications and Networking Conference

View full text Add to dashboard Cite

Abstract-This paper describes a distributed wireless network architecture for remote diagnosis and monitoring. Wind energy conversion system (WECS) is considered as the target application, where windmills are grouped into small clusters communicating with each other for the purpose of distributed diagnosis. The evaluation of this network is simulated to support effective communication needs required for fault detection and as a result to send an alarm or caution message to the remote monitoring station. However, as the sensitivity of application increases, strict requirements on availability, robustness, reliability and performance of network resources must be satisfied in order to meet industrial standards.

show abstract

“…The set of proofs presented in (DUARTE-JR; WEBER; FONSECA, 2012) is based on the theoretical framework known as Bounded Correctness (SUB-BIAH;BLOUGH, 2004). In this paper we employ a set of proofs which does not rely on that framework, and may be more intuitive for readers which are not familiar with it.…”

Section: Introductionmentioning

confidence: 99%

“…Subbiah and Blough introduced in (SUBBIAH;BLOUGH, 2004) a formal model of the dynamic behavior of diagnosis algorithms, called Bounded Correctness, which allows a diagnosis algorithm to be rigorously proven to be correct under a dynamic fault situation. Bounded Correctness has three goals.…”

Section: Introductionmentioning

confidence: 99%

Alternative specification and correctness proofs of the distributed network reachability algorithm DOI 10.5752/P.2316-9451.2012v1n1p5

Weber

Duarte

Fonseca

2012

Abakos

View full text Add to dashboard Cite

Alternative specification and correctness proofs of the distributed network reachability algorithm AbstractThe Distributed Network Reachability algorithm (DUARTE-JR; WEBER; FONSECA, 2012) allows every node in a general topology network to determine which portions of the network are reachable and unreachable. The algorithm consists of three phases: test, dissemination, and reachability computation. During the testing phase each link is tested by one of the adjacent nodes at alternating testing intervals. Upon the detection of a new event, the tester starts the dissemination phase. In this work we both give an alternative specification of DNR that employs tokens at the testing phase allowing the pair of nodes connected by a link to share testing responsibilities, and give an alternative set of proofs for the algorithm.

show abstract

Distributed diagnosis in dynamic fault environments

Cited by 50 publications

References 25 publications

Automating fault tolerance in high-performance computational biological jobs using multi-agent approaches

Automating fault tolerance in high-performance computational biological jobs using multi-agent approaches

Wireless Network Architecture for Diagnosis and Monitoring Applications

Alternative specification and correctness proofs of the distributed network reachability algorithm DOI 10.5752/P.2316-9451.2012v1n1p5

Contact Info

Product

Resources

About