2017
DOI: 10.1177/1094342017711505
|View full text |Cite|
|
Sign up to set email alerts
|

A failure detector for HPC platforms

Abstract: Abstract:Building an infrastructure for exascale applications requires, in addition to many other key components, a stable and efficient failure detector. This paper describes the design and evaluation of a robust failure detector, that can maintain and distribute the correct list of alive resources within proven and scalable bounds. The detection and distribution of the fault information follow different overlay topologies that together guarantee minimal disturbance to the applications. A virtual observation … Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

0
15
0

Year Published

2020
2020
2022
2022

Publication Types

Select...
4
2
2

Relationship

1
7

Authors

Journals

citations
Cited by 14 publications
(15 citation statements)
references
References 32 publications
0
15
0
Order By: Relevance
“…However, in ULFM, application time grows significantly as the number of ranks increases. ULFM extends MPI with an always-on, periodic heartbeat mechanism [8] to detect failures and also modifies communication primitives for fault tolerant operation. Following from our measurements, those extensions noticeably increase the original application execution time.…”
Section: Discussionmentioning
confidence: 99%
“…However, in ULFM, application time grows significantly as the number of ranks increases. ULFM extends MPI with an always-on, periodic heartbeat mechanism [8] to detect failures and also modifies communication primitives for fault tolerant operation. Following from our measurements, those extensions noticeably increase the original application execution time.…”
Section: Discussionmentioning
confidence: 99%
“…In the failure detection stage, stop & restart techniques cause the entire application to abort when one or several processes fail, while the ULFM resilience constructs enable failure notification to some or all the remaining live processes without global cancellation of the application. Besides, the existence of a well-defined propagation mechanism (i.e., communication revocation), exposed through the ULFM API, allows for highly optimized implementations, as proposed in [8]. Such implementations take advantage of underlying MPI capabilities and the structure of applications to improve the speed at which process faults are detected and to deliver a fast and reliable multicast using the same [53] simulates the main procedures in a 3D method of characteristics (MOC) code for the numerical solution of the steady-state neutron transport equation.…”
Section: Resilient Vs Stop and Restart Solutionsmentioning
confidence: 99%
“…It can introduce memory access and communication latency to the application execution and further affect the application execution efficiency. As reported in a ULFM paper [24], ULFM implements a constantly heartbeat mechanism for failures detection, and also amends MPI communication interfaces for failure recovery operations. These changes must have an impact on application execution 0 50 100 150 200 250 300 RESTART-FTI REINIT-FTI ULFM-FTI RESTART-FTI REINIT-FTI ULFM-FTI RESTART-FTI REINIT-FTI ULFM-FTI RESTART-FTI REINIT- Furthermore, we observe that the times for writing checkpoints in RESTART-FTI and REINIT-FTI cases are close.…”
Section: Performance Comparison On Different Scaling Sizesmentioning
confidence: 99%