Abstract:With the increasing fault rate on high-end supercomputers, the topic of fault tolerance has been gathering attention. To cope with this situation, various fault-tolerance techniques are under investigation; these include user-level, algorithm-based fault-tolerance techniques and parallel execution environments that enable jobs to continue following node failure. Even with these techniques, some programs with static load balancing, such as stencil computation, may underperform after a failure recovery. Even whe… Show more
“…[22]), but this is beyond the scope of this paper. More generally speaking, depending on how exactly CR is implemented, it could make use of spare nodes, enabling the application to keep using most of the nodes already allocated to it, but making it necessary to identify the identity of the lost node just as in the case of ESR; or the whole application could be restarted on newly-allocated nodes, although this is likely to be more costly that identifying the lost nodes, particularly at greater scales [6,12,26].…”
Section: Beyond Node-failure Simulationmentioning
confidence: 99%
“…The work mentioned so far supposes the availability of spare nodes. In [12], Hori et al propose strategies for the allocation of these spare nodes, and the replacement of lost nodes, when runtime performance is of consideration.…”
As computers reach exascale and beyond, the incidence of faults will increase. Solutions to this problem are an active research topic. We focus on strategies to make the preconditioned conjugate gradient (PCG) solver resilient against node failures, speci cally, the exact state reconstruction (ESR) method, which exploits redundancies in PCG. Reducing the frequency at which redundant information is stored lessens the runtime overhead. However, after the node failure, the solver must restart from the last iteration for which redundant information was stored, which increases recovery overhead. This formulation highlights the method's similarities to checkpoint-restart (CR). Thus, this method, which we call ESR with periodic storage (ESRP), can be considered a form of algorithm-based checkpointrestart. The state is stored implicitly, by exploiting redundancy inherent to the algorithm, rather than explicitly as in CR. We also minimize the amount of data to be stored and retrieved compared to CR, but additional computation is required to reconstruct the solver's state. In this paper, we describe the necessary modi cations to ESR to convert it into ESRP, and perform an experimental evaluation. We compare ESRP experimentally with previously-existing ESR and application-level in-memory CR. Our results con rm that the overhead for ESR is reduced signi cantly, both in the failure-free case, and if node failures are introduced. In the former case, the overhead of ESRP is usually lower than that of CR. However, CR is faster if node failures happen. We claim that these di erences can be alleviated by the implementation of more appropriate preconditioners.
“…[22]), but this is beyond the scope of this paper. More generally speaking, depending on how exactly CR is implemented, it could make use of spare nodes, enabling the application to keep using most of the nodes already allocated to it, but making it necessary to identify the identity of the lost node just as in the case of ESR; or the whole application could be restarted on newly-allocated nodes, although this is likely to be more costly that identifying the lost nodes, particularly at greater scales [6,12,26].…”
Section: Beyond Node-failure Simulationmentioning
confidence: 99%
“…The work mentioned so far supposes the availability of spare nodes. In [12], Hori et al propose strategies for the allocation of these spare nodes, and the replacement of lost nodes, when runtime performance is of consideration.…”
As computers reach exascale and beyond, the incidence of faults will increase. Solutions to this problem are an active research topic. We focus on strategies to make the preconditioned conjugate gradient (PCG) solver resilient against node failures, speci cally, the exact state reconstruction (ESR) method, which exploits redundancies in PCG. Reducing the frequency at which redundant information is stored lessens the runtime overhead. However, after the node failure, the solver must restart from the last iteration for which redundant information was stored, which increases recovery overhead. This formulation highlights the method's similarities to checkpoint-restart (CR). Thus, this method, which we call ESR with periodic storage (ESRP), can be considered a form of algorithm-based checkpointrestart. The state is stored implicitly, by exploiting redundancy inherent to the algorithm, rather than explicitly as in CR. We also minimize the amount of data to be stored and retrieved compared to CR, but additional computation is required to reconstruct the solver's state. In this paper, we describe the necessary modi cations to ESR to convert it into ESRP, and perform an experimental evaluation. We compare ESRP experimentally with previously-existing ESR and application-level in-memory CR. Our results con rm that the overhead for ESR is reduced signi cantly, both in the failure-free case, and if node failures are introduced. In the former case, the overhead of ESRP is usually lower than that of CR. However, CR is faster if node failures happen. We claim that these di erences can be alleviated by the implementation of more appropriate preconditioners.
“…However, it is often the case that some resources are not used while there are jobs in the queue, since the resource requirements of the waiting jobs are greater than the available resources. In addition, some jobs in progress do not use all their resources efficiently during execution, for example because they do not use all of their resources during the entire execution, or because they have spare nodes for fault tolerance [3].…”
Applying malleability to HPC systems can increase their productivity without degrading or even improving the performance of running applications.This paper presents Proteo, a configurable framework that allows to design benchmarks to study the effect of malleability on a system, and also incorporates malleability into a real application. Proteo consists of two modules: SAM allows to emulate the computational behavior of iterative scientific MPI applications; and MaM is able to reconfigure a job during execution, adjusting the number of processes, redistributing data and resuming execution. An in-depth study of all the possibilities shows that Proteo is able to behave like a real malleable or non-malleable application in the range [0.85, 1.15]. Furthermore, the different methods defined in MaM for process management and data redistribution are analyzed, concluding that asynchronous malleability, where reconfiguration and application execution overlap, results in a 1.15x speedup.
Applying malleability to HPC systems can increase their productivity without degrading or even improving the performance of running applications. This paper presents Proteo, a configurable framework that allows to design benchmarks to study the effect of malleability on a system, and also incorporates malleability into a real application. Proteo consists of two modules: SAM allows to emulate the computational behavior of iterative scientific MPI applications, and MaM is able to reconfigure a job during execution, adjusting the number of processes, redistributing data, and resuming execution. An in-depth study of all the possibilities shows that Proteo is able to behave like a real malleable or non-malleable application in the range [0.85, 1.15]. Furthermore, the different methods defined in MaM for process management and data redistribution are analyzed, concluding that asynchronous malleability, where reconfiguration and application execution overlap, results in a 1.15$$\times$$
×
speedup.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.