Total order broadcast and multicast (also called atomic broadcast/multicast) present an important problem in distributed systems, especially with respect to fault-tolerance. In short, the primitive ensures that messages sent to a set of processes are, in turn, delivered by all those processes in the same total order. The problem has inspired an abundance of literature, with a plethora of proposed algorithms. This article proposes a classification of total order broadcast and multicast algorithms based on their ordering mechanisms, and addresses a number of other important issues. The article surveys about sixty algorithms, thus providing by far the most extensive study of the problem so far. The article discusses algorithms for both the synchronous and the asynchronous system models, and studies the respective properties and behavior of the different algorithms.X. Défago et al.
The detection of failures is a fundamental issue for faulttolerance in distributed systems. Recently, many people have come to realize that failure detection ought to be provided as some form of generic service, similar to IP address lookup or time synchronization. However, this has not been successful so far; one of the reasons being the fact that classical failure detectors were not designed to satisfy several application requirements simultaneously.We present a novel abstraction, called accrual failure detectors, that emphasizes flexibility and expressiveness and can serve as a basic building block to implementing failure detectors in distributed systems. Instead of providing information of a binary nature (trust vs. suspect), accrual failure detectors output a suspicion level on a continuous scale. The principal merit of this approach is that it favors a nearly complete decoupling between application requirements and the monitoring of the environment.In this paper, we describe an implementation of such an accrual failure detector, that we call the ϕ failure detector. The particularity of the ϕ failure detector is that it dynamically adjusts to current network conditions the scale on which the suspicion level is expressed. We analyzed the behavior of our ϕ failure detector over an intercontinental communication link over a week. Our experimental results show that ϕ performs equally well as other known adaptive failure detection mechanisms, with an improved flexibility.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.