Abstract. We provide a suite of impossibility results and lower bounds for the required number of processes and rounds for synchronous consensus under transient link failures. Our results show that consensus can be solved even in presence of O(n 2 ) moving omission and/or arbitrary link failures per round, provided that both the number of affected outgoing and incoming links of every process is bounded. Providing a step further towards the weakest conditions under which consensus is solvable, our findings are applicable to a variety of dynamic phenomena such as transient communication failures and end-to-end delay variations. We also prove that our model surpasses alternative link failure modeling approaches in terms of assumption coverage.Key words. Fault-tolerant distributed algorithms, consensus, transient link failures, impossibility results, lower bounds, assumption coverage analysis.AMS subject classifications. 68Q25, 68Q85, 68W15, 68W401. Introduction. Most research on fault-tolerant distributed algorithms conducted in the past rests on process failure models. Every failure occurring in a system is attributed to either the sending or receiving process here, irrespectively of whether the actual error occurs at this process or rather on the intermediate communication path. Moreover, a process that commits a single failure is often "statically" considered faulty during the whole execution, even if its failure is transient.Although such process failure models adequately capture many important scenarios, including crash failures where a faulty process just stops operating, and Byzantine failures where a faulty process can do anything, they are not particularly suitable for modeling more dynamic phenomena. In particular, given the steadily increasing dominance of communication over computation in modern distributed systems, in conjunction with the high reliability of modern processors and robust operating system designs, transient communication failures such as lost or non-recognized packets (synchronization errors), CRC errors (data corruption), and receiver overruns (packet buffer overflow) are increasingly dominating real-world failures. Another dynamic phenomenon that is encountered frequently in practice is unpredictable variations of the end-to-end delays in multi-hop networks such as the Internet, which are caused, for example, by temporary network congestion and intermediate router failures. Since excessive end-to-end delays appear as omissions in classic (semi-)synchronous systems and other time(out)-based approaches, for example, [3-5, 7, 39, 43, 51, 53], such timing variations can also be considered as transient link failures.The distinguishing properties of such failures are (a) that they affect the path (termed link in the sequel) connecting two processes, rather than the endpoints (the processes), and (b) that they are mobile [58], as different links may fail at different times. Hence, the ability to communicate [in a timely manner] with other processes in the system cannot be statically attribute...
We introduce a comprehensive hybrid failure model for synchronous distributed systems, which extends a conventional hybrid process failure model by adding communication failures: Every process in the system is allowed to commit up to fℓs send link failures and experience up to fℓr receive link failures per round here, without being considered faulty; up to some fℓsa≤fℓs and fℓra≤fℓr among those may even cause erroneous messages rather than just omissions. In a companion paper (Schmid et al. (2009) [14]), devoted to a complete suite of related impossibility results and lower bounds, we proved that this model surpasses all existing link failure modeling approaches in terms of the assumption coverage in a simple probabilistic setting.In this paper, we show that several well-known synchronous consensus algorithms can be adapted to work under our failure model, provided that the number of processes required for tolerating process failures is increased by small integer multiples of fℓs, fℓr, fℓsa, fℓra. This is somewhat surprising, given that consensus in the presence of unrestricted link failures and mobile (moving) process omission failures is impossible. We provide detailed formulas for the required number of processes and rounds, which reveal that the lower bounds established in our companion paper are tight. We also explore the power and limitations of authentication in our setting, and consider uniform consensus algorithms, which guarantee their properties also for benign faulty processes.
We propose a fault-tolerant algorithm for synchronizing both state and rate of clocks in a distributed system. This algorithm is based on rounds, uses our fault-tolerant Optimal Precision (OP) convergence function as the means of synchronization, and maintains a collection of intervals to keep track of real-time, internal global time, and clock rates. The analysis shows that the interlocking between state and rate synchronization can be easily solved, and that oscillator stabilities together with the transmission delay uncertainties of packets predominate the internal synchronization. In addition, average case results gathered from simulation experiments with our SimUTC toolkit prove to be about one order of magnitude better than the worst case ones from the analysis of our state&rate algorithm.
No abstract
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
hi@scite.ai
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.