A survey of rollback-recovery protocols in message-passing systems

Elnozahy, E. N. Mootaz; Alvisi, Lorenzo; Wang, Yimin; Johnson, David B.

doi:10.1145/568522.568525

Cited by 1,392 publications

(1,079 citation statements)

References 39 publications

Supporting

Mentioning

994

Contrasting

Unclassified

Order By: Relevance

“…Optimistic message logging is very attractive for providing fault-tolerance with low failure-free overhead for large-scale distributed systems [3]. However, it may suffer from cascading rollback due to its message log volatility.…”

Section: Introductionmentioning

confidence: 99%

On Reducing Rollback Propagation Effect of Optimistic Message Logging for Group-Based Distributed Systems

Ahn

2013

IEICE Trans. Inf. & Syst.

View full text Add to dashboard Cite

SUMMARYThis paper presents a new scalable method to considerably reduce the rollback propagation effect of the conventional optimistic message logging by utilizing positive features of reliable FIFO group communication links. To satisfy this goal, the proposed method forces group members to replicate different receive sequence numbers (RSNs), which they assigned for each identical message to their group respectively, into their volatile memories. As the degree of redundancy of RSNs increases, the possibility of local recovery for each crashed process may significantly be higher. Experimental results show that our method can outperform the previous one in terms of the rollback distance of non-faulty processes with a little normal time overhead.

show abstract

Section: Introductionmentioning

confidence: 99%

On Reducing Rollback Propagation Effect of Optimistic Message Logging for Group-Based Distributed Systems

Ahn

2013

IEICE Trans. Inf. & Syst.

View full text Add to dashboard Cite

show abstract

“…During recovery, log-based rollback-recovery protocols force the execution of the system to be identical to the one that occurred before the failure, up to the maximum recoverable state. Therefore, the system always recovers to a state that is consistent with the input and output interactions that occurred up to the maximum recoverable state [45]. …”

Section: Log-based Rollback Recoverymentioning

confidence: 75%

“…Lost messages may occur when in-transit messages between two processes are not captured by a checkpointing mechanism. Therefore when these two checkpoint files are restored for the application to continue, p2 will never receive the message m1 (unless retransmitted) and this can lead to a failure [45].…”

Section: Terminologymentioning

confidence: 99%

“…This approach is suitable for applications that interact with the outside world which consists of all input and output devices that cannot roll back [45].…”

Section: Maximum Recovery Statementioning

confidence: 99%

“…For example, a process may decide to take checkpoints where the amount of state information to be saved is small, thus reducing overhead. This means that failure free performance overhead is low compared to other checkpoint based recovery techniques [45].…”

Section: Uncoordinated Checkpointingmentioning

confidence: 99%

See 2 more Smart Citations

Checkpointing of Parallel Applications in a Grid Environment

Sajadah

Terstyansky

Winter

et al.

Distributed and Parallel Systems

View full text Add to dashboard Cite

This is an electronic version of an MPhil thesis awarded by the University of Westminster.This is an exact reproduction of the paper copy held by the University of Westminster library.The WestminsterResearch online digital archive at the University of Westminster aims to make the research output of the University available to a wider audience. Copyright and Moral Rights remain with the authors and/or copyright owners. Users are permitted to download and/or print one copy for non-commercial private study or research. Further distribution and any use of material from within this archive for profit-making enterprises or for commercial gain is strictly forbidden.Whilst further distribution of specific materials from within this archive is forbidden, you may freely distribute the URL of WestminsterResearch: (http://westminsterresearch.wmin.ac.uk/). Many thanks are due to all those who read this document and spent hours helping me amassing the information used here.And finally, I wish to thank all my family and beloved friends for all these singular years.Abstract ii AbstractThe Grid environment is generic, heterogeneous, and dynamic with lots of unreliable resources making it very exposed to failures. The environment is unreliable because it is geographically dispersed involving multiple autonomous administrative domains and it is composed of a large number of components. Examples of failures in the Grid environment can be: application crash, Grid node crash, network failures, and Grid system component failures. These types of failures can affect the execution of parallel/distributed application in the Grid environment and so, protections against these faults are crucial. Therefore, it is essential to develop efficient fault tolerant mechanisms to allow users to successfully execute Grid applications. One of the research challenges in Grid computing is to be able to develop a fault tolerant solution that will ensure Grid applications are executed reliably with minimum overhead incurred.While checkpointing is the most common method to achieve fault tolerance, there is still a lot of work to be done to improve the efficiency of the mechanism. This thesis provides an in-depth description of a novel solution for checkpointing parallel applications executed on a Grid. The checkpointing mechanism implemented allows to checkpoint an application at regions where there is no interprocess communication involved and therefore reducing the checkpointing overhead and checkpoint size.

show abstract