SC16: International Conference for High Performance Computing, Networking, Storage and Analysis 2016
DOI: 10.1109/sc.2016.26
|View full text |Cite
|
Sign up to set email alerts
|

Failure Detection and Propagation in HPC systems

Abstract: Building an infrastructure for Exascale applications requires, in addition to many other key components, a stable and efficient failure detector. This paper describes the design and evaluation of a robust failure detector, able to maintain and distribute the correct list of alive resources within proven and scalable bounds. The detection and distribution of the fault information follow different overlay topologies that together guarantee minimal disturbance to the applications. A virtual observation ring minim… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
3
1
1

Citation Types

0
9
0

Year Published

2017
2017
2022
2022

Publication Types

Select...
3
3
1

Relationship

0
7

Authors

Journals

citations
Cited by 15 publications
(9 citation statements)
references
References 29 publications
0
9
0
Order By: Relevance
“…A collection of works on ULFM [9,[16][17][18]21,23,26] has investigated the applicability of ULFM and benchmarked individual operations of it. Bosilca et al [7,8] and Katti et al [19] propose efficient fault detection algorithms to integrate with ULFM. Teranishi et al [31] use spare processes to replace failed processes for local recovery so as to accelerate recovery of ULFM.…”
Section: Related Workmentioning
confidence: 99%
“…A collection of works on ULFM [9,[16][17][18]21,23,26] has investigated the applicability of ULFM and benchmarked individual operations of it. Bosilca et al [7,8] and Katti et al [19] propose efficient fault detection algorithms to integrate with ULFM. Teranishi et al [31] use spare processes to replace failed processes for local recovery so as to accelerate recovery of ULFM.…”
Section: Related Workmentioning
confidence: 99%
“…Algorithm‐based fault tolerance (ABFT) refers to algorithms which include fault detection or recovery. For example, matrix computation algorithms could recover from faults by way of “hot‐replacement.” 11 One common fault detection method is based on the communication timeouts, such as the logical ring topology proposed by Bosilca et al, 12 sending periodic keep‐alive messages in parallel with application execution. Fault‐aware MPI 13 represents another approach for applications to address faults by defining “transactions” which could be either committed or rolled‐back in the case of a fault, comparable to ULFM.…”
Section: Related Workmentioning
confidence: 99%
“…Within an MPI communication this can result in a deadlock due to open MPI requests. These failures are a main motivation behind the design of the ULFM extension [7]. If a hard failure occurs it is not straight forward to continue the computation.…”
Section: A Faults and Failuresmentioning
confidence: 99%
“…Current MPI implementations thus typically terminate (or deadlock) in such a situation. The most prominent proposal which suggests a suitable extension to the MPI standard currently is User-Level Failure Mitigation (ULFM) [7], [8]. It allows users to define a workaround for the node loss scenario, e.g.…”
Section: Introductionmentioning
confidence: 99%