Implementation and performance evaluation of an adaptable failure detector

Bertier, Marin; Marin, Olivier; Sens, Pierre

doi:10.1109/dsn.2002.1028920

Cited by 153 publications

(142 citation statements)

References 12 publications

(11 reference statements)

Supporting

Mentioning

129

Contrasting

Unclassified

Order By: Relevance

“…An adaptive failure detector should be designed to improve the quality of failure detection service to fit the application needs and network environmental changes. Bertier [7] proposed the implementation of failure detectors based on failure detection as a novel shared service between several applications. Failure detection based on the sharing of other nodes' failure status can facilitate detection time at the cost of increased overhead control.…”

Section: Background and Related Workmentioning

confidence: 99%

Failure Detection in P2P-Grid System

Wang

Nakazato

2015

IEICE Trans. Inf. & Syst.

View full text Add to dashboard Cite

SUMMARYPeer-to-peer (P2P)-Grid systems are being investigated as a platform for converging the Grid and P2P network in the construction of large-scale distributed applications. The highly dynamic nature of P2P-Grid systems greatly affects the execution of the distributed program. Uncertainty caused by arbitrary node failure and departure significantly affects the availability of computing resources and system performance. Checkpoint-and-restart is the most common scheme for fault tolerance because it periodically saves the execution progress onto stable storage. In this paper, we suggest a checkpoint-and-restart mechanism as a fault-tolerant method for applications on P2P-Grid systems. Failure detection mechanism is a necessary prerequisite to fault tolerance and fault recovery in general. Given the highly dynamic nature of nodes within P2P-Grid systems, any failure should be detected to ensure effective task execution. Therefore, failure detection mechanism as an integral part of P2P-Grid systems was studied. We discussed how the design of various failure detection algorithms affects their performance in average failure detection time of nodes. Numerical analysis results and implementation evaluation are also provided to show different average failure detection times in real systems for various failure detection algorithms. The comparison shows the shortest average failure detection time by 8.8s on basis of the WP failure detector. Our lowest mean time to recovery (MTTR) is also proven to have a distinct advantage with a time consumption reduction of about 5.5s over its counterparts.

show abstract

Section: Background and Related Workmentioning

confidence: 99%

Failure Detection in P2P-Grid System

Wang

Nakazato

2015

IEICE Trans. Inf. & Syst.

View full text Add to dashboard Cite

show abstract

“…Often these pseudo-codes use syntactical constructs such as repeat periodically (Chandra & Toueg, 1996) (Aguilera, Chen, & Toueg, 1999) (Bertier, Marin, & Sens, 2002), at time t send heartbeat (Chen, Toueg, & Aguilera, 2002;Bertier et al, 2002), at time t check whether message has arrived , or upon receive , together with several variants (see Table 1). Such syntactical constructs are not often found in COTS programming languages such as C or C++, which leads us to the problem of translating the protocol specifications into running software prototypes using one such standard language.…”

Section: Failure Detection Protocols In the Application Layermentioning

confidence: 99%

Application-Layer Fault-Tolerance Protocols

Florio

2009

View full text Add to dashboard Cite

“…Failure detection in MPI relies usually on heart beat technique [2] or on senderbased logging [16] that consist in detecting remote activity through the network. Such techniques detect node or link failures, not data corruption.…”

Section: Related Workmentioning

confidence: 99%

High Performance Checksum Computation for Fault-Tolerant MPI over Infiniband

Denis

Ishikawa

2012

Recent Advances in the Message Passing Interface

View full text Add to dashboard Cite

To cite this version:Alexandre Abstract. With the increase of the number of nodes in clusters, the probability of failures and unusual events increases. In this paper, we present checksum mechanisms to detect data corruption. We study the impact of checksums on network communication performance and we propose a mechanism to amortize their cost on InfiniBand. We have implemented our mechanisms in the NEWMADELEINE communication library. Our evaluation shows that our mechanisms to ensure message integrity do not impact noticeably the application performance, which is an improvement over the state of the art MPI implementations.

show abstract

Implementation and performance evaluation of an adaptable failure detector

Cited by 153 publications

References 12 publications

Failure Detection in P2P-Grid System

Failure Detection in P2P-Grid System

Application-Layer Fault-Tolerance Protocols

High Performance Checksum Computation for Fault-Tolerant MPI over Infiniband

Contact Info

Product

Resources

About