Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis 2015
DOI: 10.1145/2807591.2807665
|View full text |Cite
|
Sign up to set email alerts
|

Practical scalable consensus for pseudo-synchronous distributed systems

Abstract: The ability to consistently handle faults in a distributed environment requires, among a small set of basic routines, an agreement algorithm allowing surviving entities to reach a consensual decision between a bounded set of volatile resources. This paper presents an algorithm that implements an Early Returning Agreement (ERA) in pseudo-synchronous systems, which optimistically allows a process to resume its activity while guaranteeing strong progress. We prove the correctness of our ERA algorithm, and expose … Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

0
18
0
1

Year Published

2016
2016
2022
2022

Publication Types

Select...
3
2
2

Relationship

2
5

Authors

Journals

citations
Cited by 17 publications
(19 citation statements)
references
References 36 publications
0
18
0
1
Order By: Relevance
“…The last experiment (right in Figure 7) presents the performance of the agreement algorithm after failures have been injected. The authors of [14] presented a similar performance result for their agreement algorithm. In their results, the agreement performance was severely impacted when failure were discovered during the agreement (with the failure free performance of 80µs increasing to approximatively 80ms), an effect the authors claim is due to failure detection overhead.…”
Section: Failure Detection Timementioning
confidence: 72%
See 1 more Smart Citation
“…The last experiment (right in Figure 7) presents the performance of the agreement algorithm after failures have been injected. The authors of [14] presented a similar performance result for their agreement algorithm. In their results, the agreement performance was severely impacted when failure were discovered during the agreement (with the failure free performance of 80µs increasing to approximatively 80ms), an effect the authors claim is due to failure detection overhead.…”
Section: Failure Detection Timementioning
confidence: 72%
“…Critical fault-tolerant algorithms for HPC, and implementations of communication middleware for unreliable systems rely on the strong properties of perfect failure detectors (see e.g. [9], [14], [5], [6], [19]). Their cost, in terms of computation and communication overhead, as well as their properties in terms of latency to detect and notify failures and of reliability, have thus a significant impact on the overall performance of a fault-tolerant HPC solution.…”
Section: Introductionmentioning
confidence: 99%
“…A consensus protocol to build fault-tolerant HPC applications which proposes an agreement algorithm implemented within the ULFM API is proposed in [26]. The algorithm assumes the fail-stop model.…”
Section: Related Workmentioning
confidence: 99%
“…Our group system allows not only to deal with faults, once it is built on top of ULFM, but also with performance issues-the purpose is to keep a group of processes that have a high probability of presenting good performance. Checkpoint-restart, ABFT, and statemachine replication strategies can all be applied on top of the group of recommended processes, that is, our [25], User Level Failure Mitigation (ULFM) [9], Consensus Protocol [26][27][28], Adaptive MPI (AMPI) [37] Primitives for dealing with fault tolerance at the application level Fenix [31,32] Checkpoint-restart at the application level Dealing with process faults using ABFT [30] Algorithm-Based Fault Tolerance (ABFT)…”
Section: Related Workmentioning
confidence: 99%
See 1 more Smart Citation