As reported by many recent studies, the mean time between failures of future post-petascale supercomputers is likely to reduce, compared to the current situation. The most popular fault tolerance approach for MPI applications on HPC Platforms relies on coordinated checkpointing which raises two major issues: a) global restart wastes energy since all processes are forced to rollback even in the case of a single failure; b) checkpoint coordination may slow down the application execution because of congestions on I/O resources. Alternative approaches based on uncoordinated checkpointing and message logging require logging all messages, imposing a high memory/storage occupation and a significant overhead on communications. It has recently been observed that many MPI HPC applications are send-deterministic, allowing to design new fault tolerance protocols. In this paper, we propose an uncoordinated checkpointing protocol for senddeterministic MPI HPC applications that (i) logs only a subset of the application messages and (ii) does not require to restart systematically all processes when a failure occurs. We first describe our protocol and prove its correctness. Through experimental evaluations, we show that its implementation in MPICH2 has a negligible overhead on application performance. Then we perform a quantitative evaluation of the properties of our protocol using the NAS Benchmarks. Using a clustering approach, we demonstrate that this protocol actually succeeds to combine the two expected properties: a) it logs only a small fraction of the messages and b) it reduces by a factor approaching 2 the average number of processes to rollback compared to coordinated checkpointing.
In this article, we present a unified model for several well-known checkpoint/restart protocols. The proposed model is generic enough to encompass both extremes of the checkpoint/restart space, from coordinated approaches to a variety of uncoordinated checkpoint strategies (with message logging). We identify a set of crucial parameters, instantiate them and compare the expected efficiency of the fault tolerant protocols, for a given application/platform pair. We then propose a detailed analysis of several scenarios, including some of the most powerful currently available HPC platforms, as well as anticipated Exascale designs. The results of this analytical comparison are corroborated by a comprehensive set of simulations. Altogether, they outline comparative behaviors of checkpoint strategies at very large scale, thereby providing insight that is hardly accessible to direct experimentation. Un modèle unifié pour l'évaluation des protocoles de checkpointà très largeéchelle Résumé :Nous présentons ici un modèle unifié de plusieurs protocoles de sauvegarde de points de reprise (checkpoints) et de redémarrage. Le modèle proposé est suffisamment générique pour contenir les deux extrêmes des techniques de checkpoint/restart, d'une approche coordonnéeà toute une famille de stratégies non-coordonnées (avec enregistrement de messages).Nous identifions un ensemble de paramètres cruciaux, les instancions et comparons l'espérance de l'efficacité des protocoles de tolérance aux pannes, pour un couple donné application/plate-forme. Nous proposons une analyse détaillée de plusieurs scénarios, incluant certaines des plates-formes de calcul existantes les plus puissantes, ainsi que des anticipations sur les futures plates-formes exascale. Les résultats de cette analyse sont corroborés par un ensemble de simulations. Ensemble, ces résultats illustrent le comportement relatif des différentes stratégiesà largeéchelle, fournissant des enseignements qu'il serait très difficile, voire impossible, d'obtenir par l'expérimentation directe.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
customersupport@researchsolutions.com
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.