Élisabeth Brunet scite author profile

Élisabeth Brunet

3Publications

183Citation Statements Received

94Citation Statements Given

How they've been cited

How they cite others

Affiliations

Institut Polytechnique de Paris, French Institute for Research in Computer Science and Automation, Laboratoire Bordelais de Recherche en Informatique

Publications

Order By: Most citations

Uncoordinated Checkpointing Without Domino Effect for Send-Deterministic MPI Applications

Guermouche

Ropars

Brunet

et al. 2011

View full text Add to dashboard Cite

As reported by many recent studies, the mean time between failures of future post-petascale supercomputers is likely to reduce, compared to the current situation. The most popular fault tolerance approach for MPI applications on HPC Platforms relies on coordinated checkpointing which raises two major issues: a) global restart wastes energy since all processes are forced to rollback even in the case of a single failure; b) checkpoint coordination may slow down the application execution because of congestions on I/O resources. Alternative approaches based on uncoordinated checkpointing and message logging require logging all messages, imposing a high memory/storage occupation and a significant overhead on communications. It has recently been observed that many MPI HPC applications are send-deterministic, allowing to design new fault tolerance protocols. In this paper, we propose an uncoordinated checkpointing protocol for senddeterministic MPI HPC applications that (i) logs only a subset of the application messages and (ii) does not require to restart systematically all processes when a failure occurs. We first describe our protocol and prove its correctness. Through experimental evaluations, we show that its implementation in MPICH2 has a negligible overhead on application performance. Then we perform a quantitative evaluation of the properties of our protocol using the NAS Benchmarks. Using a clustering approach, we demonstrate that this protocol actually succeeds to combine the two expected properties: a) it logs only a small fraction of the messages and b) it reduces by a factor approaching 2 the average number of processes to rollback compared to coordinated checkpointing.

show abstract

Unified model for assessing checkpointing protocols at extreme‐scale

Bosilca

Bouteiller

Brunet

et al. 2013

Concurrency and Computation

View full text Add to dashboard Cite

In this article, we present a unified model for several well-known checkpoint/restart protocols. The proposed model is generic enough to encompass both extremes of the checkpoint/restart space, from coordinated approaches to a variety of uncoordinated checkpoint strategies (with message logging). We identify a set of crucial parameters, instantiate them and compare the expected efficiency of the fault tolerant protocols, for a given application/platform pair. We then propose a detailed analysis of several scenarios, including some of the most powerful currently available HPC platforms, as well as anticipated Exascale designs. The results of this analytical comparison are corroborated by a comprehensive set of simulations. Altogether, they outline comparative behaviors of checkpoint strategies at very large scale, thereby providing insight that is hardly accessible to direct experimentation. Un modèle unifié pour l'évaluation des protocoles de checkpointà très largeéchelle Résumé :Nous présentons ici un modèle unifié de plusieurs protocoles de sauvegarde de points de reprise (checkpoints) et de redémarrage. Le modèle proposé est suffisamment générique pour contenir les deux extrêmes des techniques de checkpoint/restart, d'une approche coordonnéeà toute une famille de stratégies non-coordonnées (avec enregistrement de messages).Nous identifions un ensemble de paramètres cruciaux, les instancions et comparons l'espérance de l'efficacité des protocoles de tolérance aux pannes, pour un couple donné application/plate-forme. Nous proposons une analyse détaillée de plusieurs scénarios, incluant certaines des plates-formes de calcul existantes les plus puissantes, ainsi que des anticipations sur les futures plates-formes exascale. Les résultats de cette analyse sont corroborés par un ensemble de simulations. Ensemble, ces résultats illustrent le comportement relatif des différentes stratégiesà largeéchelle, fournissant des enseignements qu'il serait très difficile, voire impossible, d'obtenir par l'expérimentation directe.

show abstract

NEW MADELEINE: a Fast Communication Scheduling Engine for High Performance Networks

Aumage

Brunet

Furmento

et al. 2007

View full text Add to dashboard Cite

scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.

Contact Info

customersupport@researchsolutions.com

10624 S. Eastern Ave., Ste. A-614

Henderson, NV 89052, USA

This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.

Blog Terms and Conditions API Terms Privacy Policy Contact Cookie Preferences Do Not Sell or Share My Personal Information

Made with 💙 for researchers

Part of the Research Solutions Family.