Abstract. The concept of unreliable failure detectors for reliable distributed systems was introduced by Chandra and Toueg as a fine-grained means to add weak forms of synchrony into asynchronous systems. Various kinds of such failure detectors have been identified as each being the weakest to solve some specific distributed programming problem. In this paper, we provide a fresh look at failure detectors from the point of view of programming languages, more precisely using the formal tool of operational semantics. Inspired by this, we propose a new failure detector model that we consider easier to understand, easier to work with and more natural. Using operational semantics, we prove formally that representations of failure detectors in the new model are equivalent to their original representations within the model used by Chandra and Toueg.
Executive SummaryBackground In the field of Distributed Algorithms, a widely-used computation model is based on asynchronous communication between a fixed number n of connected processes, where no timing assumptions can be made. Moreover, processes are subject to crash-failure: once crashed, they do not recover. The concept of unreliable failure detectors was introduced by Chandra and Toueg [CT96] as a fine-grained means to add weak forms of synchrony into asynchronous systems. Various kinds of such failure detectors have been identified as each being the weakest to solve some specific distributed programming problem [CHT96].The two communities of Distributed Algorithms and Programming Languages do not always speak the same "language". In fact, it is often not easy to understand each other's terminology, concepts, and hidden assumptions. Thus, in this paper, we provide a fresh look at the concept of failure detectors from the point of view of programming languages, using the formal tool of operational semantics. This paper complements previous work [NFM03] in which we used an operational semantics for a distributed process calculus to formally prove that a particular algorithm (also presented in [CT96]) solves the Distributed Consensus problem. Readers who are interested in proofs about algorithms within our new model (rather than proofs about it) are thus referred to our previous paper.