Overhead of using spare nodes

Hori, Atsushi; Yoshinaga, Kazumi; Hérault, Thomas; Bouteiller, Aurélien; Bosilca, George; Ishikawa, Yutaka

doi:10.1177/1094342020901885

Cited by 5 publications

(3 citation statements)

References 24 publications

(23 reference statements)

Supporting

Mentioning

Contrasting

Order By: Relevance

“…[22]), but this is beyond the scope of this paper. More generally speaking, depending on how exactly CR is implemented, it could make use of spare nodes, enabling the application to keep using most of the nodes already allocated to it, but making it necessary to identify the identity of the lost node just as in the case of ESR; or the whole application could be restarted on newly-allocated nodes, although this is likely to be more costly that identifying the lost nodes, particularly at greater scales [6,12,26].…”

Section: Beyond Node-failure Simulationmentioning

confidence: 99%

See 1 more Smart Citation

Algorithm-Based Checkpoint-Recovery for the Conjugate Gradient Method

Pachajoa

Pacher

Levonyak

et al. 2020

49th International Conference on Parallel Processing - ICPP

View full text Add to dashboard Cite

As computers reach exascale and beyond, the incidence of faults will increase. Solutions to this problem are an active research topic. We focus on strategies to make the preconditioned conjugate gradient (PCG) solver resilient against node failures, speci cally, the exact state reconstruction (ESR) method, which exploits redundancies in PCG. Reducing the frequency at which redundant information is stored lessens the runtime overhead. However, after the node failure, the solver must restart from the last iteration for which redundant information was stored, which increases recovery overhead. This formulation highlights the method's similarities to checkpoint-restart (CR). Thus, this method, which we call ESR with periodic storage (ESRP), can be considered a form of algorithm-based checkpointrestart. The state is stored implicitly, by exploiting redundancy inherent to the algorithm, rather than explicitly as in CR. We also minimize the amount of data to be stored and retrieved compared to CR, but additional computation is required to reconstruct the solver's state. In this paper, we describe the necessary modi cations to ESR to convert it into ESRP, and perform an experimental evaluation. We compare ESRP experimentally with previously-existing ESR and application-level in-memory CR. Our results con rm that the overhead for ESR is reduced signi cantly, both in the failure-free case, and if node failures are introduced. In the former case, the overhead of ESRP is usually lower than that of CR. However, CR is faster if node failures happen. We claim that these di erences can be alleviated by the implementation of more appropriate preconditioners.

show abstract

Section: Beyond Node-failure Simulationmentioning

confidence: 99%

“…The work mentioned so far supposes the availability of spare nodes. In [12], Hori et al propose strategies for the allocation of these spare nodes, and the replacement of lost nodes, when runtime performance is of consideration.…”

Section: Related Workmentioning

confidence: 99%

Algorithm-Based Checkpoint-Recovery for the Conjugate Gradient Method

Pachajoa

Pacher

Levonyak

et al. 2020

49th International Conference on Parallel Processing - ICPP

View full text Add to dashboard Cite

show abstract

“…However, it is often the case that some resources are not used while there are jobs in the queue, since the resource requirements of the waiting jobs are greater than the available resources. In addition, some jobs in progress do not use all their resources efficiently during execution, for example because they do not use all of their resources during the entire execution, or because they have spare nodes for fault tolerance [3].…”

Section: Introductionmentioning

confidence: 99%

Proteo: A Framework for the Generation and Evaluation of Malleable MPI Applications

Martín-Álvarez,

Aliaga,

Castillo

et al. 2024

Preprint

View full text Add to dashboard Cite

Applying malleability to HPC systems can increase their productivity without degrading or even improving the performance of running applications.This paper presents Proteo, a configurable framework that allows to design benchmarks to study the effect of malleability on a system, and also incorporates malleability into a real application. Proteo consists of two modules: SAM allows to emulate the computational behavior of iterative scientific MPI applications; and MaM is able to reconfigure a job during execution, adjusting the number of processes, redistributing data and resuming execution. An in-depth study of all the possibilities shows that Proteo is able to behave like a real malleable or non-malleable application in the range [0.85, 1.15]. Furthermore, the different methods defined in MaM for process management and data redistribution are analyzed, concluding that asynchronous malleability, where reconfiguration and application execution overlap, results in a 1.15x speedup.

show abstract

Proteo: a framework for the generation and evaluation of malleable MPI applications

Martín-Álvarez,

Aliaga,

Castillo

et al. 2024

J Supercomput

View full text Add to dashboard Cite

Applying malleability to HPC systems can increase their productivity without degrading or even improving the performance of running applications. This paper presents Proteo, a configurable framework that allows to design benchmarks to study the effect of malleability on a system, and also incorporates malleability into a real application. Proteo consists of two modules: SAM allows to emulate the computational behavior of iterative scientific MPI applications, and MaM is able to reconfigure a job during execution, adjusting the number of processes, redistributing data, and resuming execution. An in-depth study of all the possibilities shows that Proteo is able to behave like a real malleable or non-malleable application in the range [0.85, 1.15]. Furthermore, the different methods defined in MaM for process management and data redistribution are analyzed, concluding that asynchronous malleability, where reconfiguration and application execution overlap, results in a 1.15$$\times$$ × speedup.

show abstract

Overhead of using spare nodes

Cited by 5 publications

References 24 publications

Algorithm-Based Checkpoint-Recovery for the Conjugate Gradient Method

Algorithm-Based Checkpoint-Recovery for the Conjugate Gradient Method

Proteo: A Framework for the Generation and Evaluation of Malleable MPI Applications

Proteo: a framework for the generation and evaluation of malleable MPI applications

Contact Info

Product

Resources

About