Practical scalable consensus for pseudo-synchronous distributed systems

Hérault, Thomas; Bouteiller, Aurélien; Bosilca, George; Gamell, Marc; Teranishi, Keita; Parashar, Manish; Dongarra, Jack

doi:10.1145/2807591.2807665

Cited by 17 publications

(19 citation statements)

References 36 publications

Supporting

Mentioning

Contrasting

Unclassified

Order By: Relevance

“…The last experiment (right in Figure 7) presents the performance of the agreement algorithm after failures have been injected. The authors of [14] presented a similar performance result for their agreement algorithm. In their results, the agreement performance was severely impacted when failure were discovered during the agreement (with the failure free performance of 80µs increasing to approximatively 80ms), an effect the authors claim is due to failure detection overhead.…”

Section: Failure Detection Timementioning

confidence: 72%

“…Critical fault-tolerant algorithms for HPC, and implementations of communication middleware for unreliable systems rely on the strong properties of perfect failure detectors (see e.g. [9], [14], [5], [6], [19]). Their cost, in terms of computation and communication overhead, as well as their properties in terms of latency to detect and notify failures and of reliability, have thus a significant impact on the overall performance of a fault-tolerant HPC solution.…”

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

Failure Detection and Propagation in HPC systems

Bosilca

Bouteiller

Guermouche

et al. 2016

SC16: International Conference for High Performance Computing, Networking, Storage and Analysis

Self Cite

View full text Add to dashboard Cite

Building an infrastructure for Exascale applications requires, in addition to many other key components, a stable and efficient failure detector. This paper describes the design and evaluation of a robust failure detector, able to maintain and distribute the correct list of alive resources within proven and scalable bounds. The detection and distribution of the fault information follow different overlay topologies that together guarantee minimal disturbance to the applications. A virtual observation ring minimizes the overhead by allowing each node to be observed by another single node, providing an unobtrusive behavior. The propagation stage is using a non-uniform variant of a reliable broadcast over a circulant graph overlay network, and guarantees a logarithmic fault propagation. Extensive simulations, together with experiments on the Titan ORNL supercomputer, show that the algorithm performs extremely well, and exhibits all the desired properties of an Exascale-ready algorithm.

show abstract

Section: Failure Detection Timementioning

confidence: 72%

Section: Introductionmentioning

confidence: 99%

Failure Detection and Propagation in HPC systems

Bosilca

Bouteiller

Guermouche

et al. 2016

SC16: International Conference for High Performance Computing, Networking, Storage and Analysis

Self Cite

View full text Add to dashboard Cite

show abstract

“…A consensus protocol to build fault-tolerant HPC applications which proposes an agreement algorithm implemented within the ULFM API is proposed in [26]. The algorithm assumes the fail-stop model.…”

Section: Related Workmentioning

confidence: 99%

“…Our group system allows not only to deal with faults, once it is built on top of ULFM, but also with performance issues-the purpose is to keep a group of processes that have a high probability of presenting good performance. Checkpoint-restart, ABFT, and statemachine replication strategies can all be applied on top of the group of recommended processes, that is, our [25], User Level Failure Mitigation (ULFM) [9], Consensus Protocol [26][27][28], Adaptive MPI (AMPI) [37] Primitives for dealing with fault tolerance at the application level Fenix [31,32] Checkpoint-restart at the application level Dealing with process faults using ABFT [30] Algorithm-Based Fault Tolerance (ABFT)…”

Section: Related Workmentioning

confidence: 99%

See 1 more Smart Citation

Running resilient MPI applications on a Dynamic Group of Recommended Processes

Camargo

Duarte

2018

J Braz Comput Soc

View full text Add to dashboard Cite

High-performance computing systems run applications that can take several hours to execute and have to deal with the occurrence of a potentially large number of faults. Most of the existing fault-tolerant strategies for these systems assume crash faults that are permanent events are easily detected. This is not the case in several real systems, in particular in shared clusters, in which even the load variation may cause performance problems that are virtually equivalent to faults. In this work, we present a new model to deal with this problem in which processes execute tests among themselves in order to determine whether the processors (or cores) on which they are running are recommended or non-recommended. Processes classified as recommended form a Dynamic Group of Recommended Processes (DGRP) that runs the application. The DGRP is formed only by processes that have not been tested as non-recommended by all DGRP processes. A process not in the DGRP that is continuously tested as recommended can rejoin the DGRP after a round of consensus executed by DGRP processes. Experimental results are presented obtained from a MPI-based implementation in which the HyperQuickSort parallel sorting algorithm reconfigures itself at runtime to tolerate up to N − 1 faults (in a system with N processes) while sorting up to 1 billion integers.

show abstract

Tree‐based fault‐tolerant collective operations for MPI

Margolin

Barak

2020

Concurrency and Computation

View full text Add to dashboard Cite

Summary With the increase in size and complexity of high‐performance computing systems, the probability of failures, and the cost of recovery grow. Parallel applications running on these systems should be able to continue running in spite of node failures at arbitrary times. Collective operations are essential for many parallel MPI applications, and are often the first to detect such failures. This work presents tree‐based fault‐tolerant collective operations, which combine fault detection and recovery as an integral part each operation. We do this by extending existing tree‐based algorithms, to allow for a collective operation to succeed despite failing nodes before or during its run. This differs from other approaches, where recovery takes place after a failure of such operations have failed. The article includes a comparison between the performance of the proposed algorithm and other approaches, as well as a simulator‐based analysis of performance at scale.

show abstract

Practical scalable consensus for pseudo-synchronous distributed systems

Cited by 17 publications

References 36 publications

Failure Detection and Propagation in HPC systems

Failure Detection and Propagation in HPC systems

Running resilient MPI applications on a Dynamic Group of Recommended Processes

Tree‐based fault‐tolerant collective operations for MPI

Contact Info

Product

Resources

About