A low-overhead recovery technique using quasi-synchronous checkpointing

Manivannan, D.; Singhal, Mukesh

doi:10.1109/icdcs.1996.507906

Cited by 85 publications

(70 citation statements)

References 16 publications

Supporting

Mentioning

Contrasting

Unclassified

Order By: Relevance

“…In practice, in order to achieve high availability, self-repairing and selfhealing mechanisms are widely adopted in fault-tolerant systems to achieve automatic recovery after the crash occurs. Particularly in middleware systems, there are many techniques and algorithms are proposed to achieve the self-repairing or self-healing goal, such as the connector-based self-healing system described in [32,77] or the reflection technique adopted in [12] or the snapshot algorithms in [61,65]. As we can see that the crash-recovery failure is quite common in many fault-tolerant systems.…”

Section: Motivationmentioning

confidence: 99%

On the Quality of Service of Crash-Recovery Failure Detectors

Hillston

Anderson

2007

37th Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN'07)

View full text Add to dashboard Cite

This thesis presents the results of an investigation into the failure detection problem.We consider the specific case of the Quality of Service (QoS) of crash failure detection.In contrast to previous work, we address the crash failure detection problem when the monitored target is resilient and recovers after failure. To the best of our knowledge, this is the first work to provide an analysis of crash-recovery failure detection from the QoS perspective.We develop a probabilistic model of the behavior of a crash-recovery target, i.e. one which has the ability to recover from the crash state. We show that the fail-free run and the crash-stop run are special cases of the crash-recovery run with mean time to failure (MTTF) approaching to infinity and mean time to recovery (MTTR) approaching to infinity, respectively. We extend the previously published QoS metrics to allow the measurement of the recovery speed, and the definition of the completeness property of a failure detector. Then, the impact of the dependability of the crash-recovery target on the QoS bounds for such a crash-recovery failure detector is analyzed using general dependability metrics, such as MTTF and MTTR, based on an approximate probabilistic model of the two-process failure detection system. Then according to our approximate model, we show how to estimate the failure detector's parameters to achieve a required QoS, based on Chen et al.'s NFD-S algorithm analytically, and how to execute the configuration procedure of this crash-recovery failure detector.In order to make the failure detector adaptive to the target's crash-recovery behavior and enable the autonomy of the monitoring procedure, we propose two types of recovery detection protocols. One is a reliable recovery detection protocol, which can guarantee to detect each occurring failure and recovery by adopting persistent storage.The other is a lightweight recovery detection protocol, which does not guarantee to detect every failure and recovery but which reduces the system overhead. Both of these recovery detection protocols improve the completeness without reducing the other QoS aspects of a failure detector. In addition, we also demonstrate how to estimate the inputs, such as the dependability metrics, using the failure detector itself.In order to evaluate our analytical work, we simulate the following failure detection al- This conforms well to our models and analysis. We show that in the case of reasonable long MTTF, the NFD-S algorithm with the lightweight recovery detection protocol exhibits better QoS than the NFD-S algorithm for the completeness of a crash-recovery failure detector, and similarly for other QoS metrics.

show abstract

Section: Motivationmentioning

confidence: 99%

On the Quality of Service of Crash-Recovery Failure Detectors

Hillston

Anderson

2007

37th Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN'07)

View full text Add to dashboard Cite

show abstract

“…However, it will increase the recovery time as greater rollback will be required. Although Some algorithms were proposed to reduce the number of checkpoints to be saved on stable storage, yet, to ensure correctness, a process still needs to keep many more checkpoints in uncoordinated checkpointing algorithms [55], [58], [59], [97]. Generally speaking, uncoordinated checkpointing approaches suffer from the complexities of finding a consistent recovery line after the failure, domino-effect, high stable storage overhead of saving multiple checkpoints of each process, and the overhead of garbage collection.…”

Section: Related Workmentioning

confidence: 99%

A Review of Fault Tolerant Checkpointing Protocols for Mobile Computing Systems

Garg¹,

Kumar²

2010

IJCA

View full text Add to dashboard Cite

A distributed system is a collection of independent entities that cooperate to solve a problem that cannot be individually solved. A mobile computing system is a distributed system where some of processes are running on mobile hosts (MHs), whose location in the network changes with time. Mobile distributed systems raise new issues such as mobility, low bandwidth of wireless channels, disconnections, limited battery power and lack of reliable stable storage on mobile nodes. This paper addresses the problem of fault tolerant computing in mobile distributed systems. The techniques described are based on checkpointing and roll back recovery.

show abstract

“…Most nonblocking algorithms [13], [24], [30] use a Checkpoint Sequence Number (sn) to avoid inconsistencies. More specifically, a process is forced to take a checkpoint if it receives a computation message whose sn is greater than its local sn.…”

Section: The Basic Idea Behind Nonblocking Algorithmsmentioning

confidence: 99%

“…More information on how to deal with process failures can be found in [20], [24], [28]. Since failure detection and failure recovery are orthogonal to our discussion, we will not discuss it further.…”

Section: Handling Failures During Checkpointingmentioning

confidence: 99%

“…However, this will increase the recovery time as greater rollback and reply will be needed. Even though some algorithms [24], [35] were proposed to reduce the number of checkpoints to be saved on the stable storage, to ensure correctness, a process still needs to keep many more checkpoints in uncoordinated checkpointing algorithms than those in coordinated checkpointing algorithms. In the coordinated checkpointing algorithm presented in this paper, most of the time, each process needs to store only one permanent checkpoint on the stable storage and at most two checkpoints: a permanent and a tentative (or mutable) checkpoint only for the duration of the checkpointing.…”

Section: Related Workmentioning

confidence: 99%

See 1 more Smart Citation

Mutable checkpoints: a new checkpointing approach for mobile computing systems

Singhal

2001

IEEE Trans. Parallel Distrib. Syst.

Self Cite

109

View full text Add to dashboard Cite

AbstractÐMobile computing raises many new issues such as lack of stable storage, low bandwidth of wireless channel, high mobility, and limited battery life. These new issues make traditional checkpointing algorithms unsuitable. Coordinated checkpointing is an attractive approach for transparently adding fault tolerance to distributed applications since it avoids domino effects and minimizes the stable storage requirement. However, it suffers from high overhead associated with the checkpointing process in mobile computing systems. Two approaches have been used to reduce the overhead: First is to minimize the number of synchronization messages and the number of checkpoints; the other is to make the checkpointing process nonblocking. These two approaches were orthogonal previously until the Prakash-Singhal algorithm [28] combined them. However, we [8] found that this algorithm may result in an inconsistency in some situations and we proved that there does not exist a nonblocking algorithm which forces only a minimum number of processes to take their checkpoints. In this paper, we introduce the concept of ªmutable checkpoint,º which is neither a tentative checkpoint nor a permanent checkpoint, to design efficient checkpointing algorithms for mobile computing systems. Mutable checkpoints can be saved anywhere, e.g., the main memory or local disk of MHs. In this way, taking a mutable checkpoint avoids the overhead of transferring large amounts of data to the stable storage at MSSs over the wireless network. We present techniques to minimize the number of mutable checkpoints. Simulation results show that the overhead of taking mutable checkpoints is negligible. Based on mutable checkpoints, our nonblocking algorithm avoids the avalanche effect and forces only a minimum number of processes to take their checkpoints on the stable storage.

show abstract

A low-overhead recovery technique using quasi-synchronous checkpointing

Cited by 85 publications

References 16 publications

On the Quality of Service of Crash-Recovery Failure Detectors

On the Quality of Service of Crash-Recovery Failure Detectors

A Review of Fault Tolerant Checkpointing Protocols for Mobile Computing Systems

Mutable checkpoints: a new checkpointing approach for mobile computing systems

Contact Info

Product

Resources

About