Sliding Substitution of Failed Nodes

Hori, Atsushi; Yoshinaga, Kazumi; Hérault, Thomas; Bouteiller, Aurélien; Bosilca, George; Ishikawa, Yutaka

doi:10.1145/2802658.2802670

Cited by 6 publications

(6 citation statements)

References 15 publications

(17 reference statements)

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Hori et al 27 discussed how one can use spare nodes to restart an application which has experienced a node failure. This technique could be applied to our case.…”

Section: Related Workmentioning

confidence: 99%

Improving batch schedulers with node stealing for failed jobs

Du,

Marchal,

Pallez

et al. 2024

Concurrency and Computation

View full text Add to dashboard Cite

SummaryAfter a machine failure, batch schedulers typically re‐schedule the job that failed with a high priority. This is fair for the failed job but still requires that job to re‐enter the submission queue and to wait for enough resources to become available. The waiting time can be very long when the job is large and the platform highly loaded, as is the case with typical HPC platforms. We propose another strategy: when a job fails, if no platform node is available, we steal one node from another job , and use it to continue the execution of despite the failure. In this work, we give a detailed assessment of this node stealing strategy using traces from the Mira supercomputer at Argonne National Laboratory. The main conclusion is that node stealing improves the utilization of the platform and dramatically reduces the flow of large jobs, at the price of slightly increasing the flow of small jobs.

show abstract

“…Hori et al 27 discussed how one can use spare nodes to restart an application which has experienced a node failure. This technique could be applied to our case.…”

Section: Related Workmentioning

confidence: 99%

Improving batch schedulers with node stealing for failed jobs

Du,

Marchal,

Pallez

et al. 2024

Concurrency and Computation

View full text Add to dashboard Cite

show abstract

“…The application developer is responsible for implementing recovery using those operations, choosing the type of recovery best suited for its application. A collection of works on ULFM [9,[16][17][18]21,23,26] has investigated the applicability of ULFM and benchmarked individual operations of it. Bosilca et al [7,8] and Katti et al [19] propose efficient fault detection algorithms to integrate with ULFM.…”

Section: Related Workmentioning

confidence: 99%

“…Although there has been a large bibliography [4,5,9,11,[16][17][18][21][22][23]26] discussing the programming model and prototypes of those approaches, no study has presented an in-depth performance evaluation of them -most previous works either focus on individual aspects of each approach or perform limited scale experiments. In this paper, we present an extensive evaluation using HPC proxy applications to contrast these two leading global-restart recovery approaches.…”

Section: Introductionmentioning

confidence: 99%

Reinit$$^{++}$$: Evaluating the Performance of Global-Restart Recovery Methods for MPI Fault Tolerance

Georgakoudis

Guo

Laguna

2020

Lecture Notes in Computer Science

View full text Add to dashboard Cite

Scaling supercomputers comes with an increase in failure rates due to the increasing number of hardware components. In standard practice, applications are made resilient through checkpointing data and restarting execution after a failure occurs to resume from the latest checkpoint. However, redeploying an application incurs overhead by tearing down and reinstating execution, and possibly limiting checkpointing retrieval from slow permanent storage. In this paper we present Reinit ++ , a new design and implementation of the Reinit approach for global-restart recovery, which avoids application re-deployment. We extensively evaluate Reinit ++ contrasted with the leading MPI fault-tolerance approach of ULFM, implementing globalrestart recovery, and the typical practice of restarting an application to derive new insight on performance. Experimentation with three different HPC proxy applications made resilient to withstand process and node failures shows that Reinit ++ recovers much faster than restarting, up to 6×, or ULFM, up to 3×, and that it scales excellently as the number of MPI processes grows.

show abstract

“…In theory, five message collisions, for example, means the communication time gets slower five times. On the K computer, only three times slower communication time was observed because simultaneous four message sending takes 1.7 times of the time of sending one message ( 3 ≈ 5 / 1.7 ) (Hori et al, 2015). One possible reason to explain this slowness (1.7 with 5P-stencil and 3.7 with 7P-stencil) is the insufficient bandwidth between the memory and the network controller chip.…”

Section: Evaluations On K Bg/q and Tsubame 25mentioning

confidence: 99%

Overhead of using spare nodes

Hori

Yoshinaga

Hérault

et al. 2020

The International Journal of High Performance Computing Applica

Self Cite

View full text Add to dashboard Cite

With the increasing fault rate on high-end supercomputers, the topic of fault tolerance has been gathering attention. To cope with this situation, various fault-tolerance techniques are under investigation; these include user-level, algorithm-based fault-tolerance techniques and parallel execution environments that enable jobs to continue following node failure. Even with these techniques, some programs with static load balancing, such as stencil computation, may underperform after a failure recovery. Even when spare nodes are present, they are not always substituted for failed nodes in an effective way. This article considers the questions of how spare nodes should be allocated, how to substitute them for faulty nodes, and how much the communication performance is affected by such a substitution. The third question stems from the modification of the rank mapping by node substitutions, which can incur additional message collisions. In a stencil computation, rank mapping is done in a straightforward way on a Cartesian network without incurring any message collisions. However, once a substitution has occurred, the optimal node-rank mapping may be destroyed. Therefore, these questions must be answered in a way that minimizes the degradation of communication performance. In this article, several spare node allocation and failed node substitution methods will be proposed, analyzed, and compared in terms of communication performance following the substitution. The proposed substitution methods are named sliding methods. The sliding methods are analyzed by using our developed simulation program and evaluated by using the K computer, Blue Gene/Q (BG/Q), and TSUBAME 2.5. It will be shown that when failures occur, the stencil communication performance on the K and BG/Q can be slowed around 10 times depending on the number of node failures. The barrier performance on the K can be cut in half. On BG/Q, barrier performance can be slowed by a factor of 10. Further, it will also be shown that almost no such communication performance degradation can be seen on TSUBAME 2.5. This is because TSUBAME 2.5 has an Infiniband network connected with a FatTree topology, while the K computer and BG/Q have dedicated Cartesian networks. Thus, the communication performance degradation depends on network characteristics.

show abstract

Sliding Substitution of Failed Nodes

Cited by 6 publications

References 15 publications

Improving batch schedulers with node stealing for failed jobs

Improving batch schedulers with node stealing for failed jobs

Reinit$$^{++}$$: Evaluating the Performance of Global-Restart Recovery Methods for MPI Fault Tolerance

Overhead of using spare nodes

Contact Info

Product

Resources

About