As reported by many recent studies, the mean time between failures of future post-petascale supercomputers is likely to reduce, compared to the current situation. The most popular fault tolerance approach for MPI applications on HPC Platforms relies on coordinated checkpointing which raises two major issues: a) global restart wastes energy since all processes are forced to rollback even in the case of a single failure; b) checkpoint coordination may slow down the application execution because of congestions on I/O resources. Alternative approaches based on uncoordinated checkpointing and message logging require logging all messages, imposing a high memory/storage occupation and a significant overhead on communications. It has recently been observed that many MPI HPC applications are send-deterministic, allowing to design new fault tolerance protocols. In this paper, we propose an uncoordinated checkpointing protocol for senddeterministic MPI HPC applications that (i) logs only a subset of the application messages and (ii) does not require to restart systematically all processes when a failure occurs. We first describe our protocol and prove its correctness. Through experimental evaluations, we show that its implementation in MPICH2 has a negligible overhead on application performance. Then we perform a quantitative evaluation of the properties of our protocol using the NAS Benchmarks. Using a clustering approach, we demonstrate that this protocol actually succeeds to combine the two expected properties: a) it logs only a small fraction of the messages and b) it reduces by a factor approaching 2 the average number of processes to rollback compared to coordinated checkpointing.
In this article, we present a unified model for several well-known checkpoint/restart protocols. The proposed model is generic enough to encompass both extremes of the checkpoint/restart space, from coordinated approaches to a variety of uncoordinated checkpoint strategies (with message logging). We identify a set of crucial parameters, instantiate them and compare the expected efficiency of the fault tolerant protocols, for a given application/platform pair. We then propose a detailed analysis of several scenarios, including some of the most powerful currently available HPC platforms, as well as anticipated Exascale designs. The results of this analytical comparison are corroborated by a comprehensive set of simulations. Altogether, they outline comparative behaviors of checkpoint strategies at very large scale, thereby providing insight that is hardly accessible to direct experimentation. Un modèle unifié pour l'évaluation des protocoles de checkpointà très largeéchelle Résumé :Nous présentons ici un modèle unifié de plusieurs protocoles de sauvegarde de points de reprise (checkpoints) et de redémarrage. Le modèle proposé est suffisamment générique pour contenir les deux extrêmes des techniques de checkpoint/restart, d'une approche coordonnéeà toute une famille de stratégies non-coordonnées (avec enregistrement de messages).Nous identifions un ensemble de paramètres cruciaux, les instancions et comparons l'espérance de l'efficacité des protocoles de tolérance aux pannes, pour un couple donné application/plate-forme. Nous proposons une analyse détaillée de plusieurs scénarios, incluant certaines des plates-formes de calcul existantes les plus puissantes, ainsi que des anticipations sur les futures plates-formes exascale. Les résultats de cette analyse sont corroborés par un ensemble de simulations. Ensemble, ces résultats illustrent le comportement relatif des différentes stratégiesà largeéchelle, fournissant des enseignements qu'il serait très difficile, voire impossible, d'obtenir par l'expérimentation directe.
The current trend in clusters leads towards an increase of the number of cores per node. As a result, an increasing number of parallel applications is mixing message passing and multithreading as an attempt to better match the underlying architecture's structure. This naturally raises the problem of designing efficient, multithreaded implementations of MPI. In this paper, we present the design of a multithreaded communication engine able to exploit idle cores to speed up communications in two ways: it can move CPUintensive operations out of the critical path (e.g. PIO transfers offload), and is able to let rendezvous transfers progress asynchronously. We have implemented these methods in the PM2 software suite, evaluated their behavior in typical cases, and we have observed good performance results in overlapping communication and computation. Classical non-multithreaded communication engines such as OpenMPI [3], MPICH2 [1] or MVAPICH2 [4] do not fully take benefit from multicore architectures: com
International audienceAlthough processors become massively multicore and therefore new programming models mix message passing and multi-threading, the effects of threads on communication libraries remain neglected. Designing an efficient modern communication library requires precautions in order to limit the impact of thread-safety mechanisms on performance. In this paper, we present various approaches to building a thread-safe communication library and we study their benefit and impact on performance. We also describe and evaluate techniques used to exploit idle cores to balance the communication library load across multicore machines
This paper describes how the NewMadeleine communication library has been integrated within the MPICH2 MPI implementation and the benefits brought. NewMadeleine is integrated as a Nemesis network module but the upper layers and in particular the CH3 layer has been modified. By doing so, we allow NewMadeleine to fully deliver its performance to an MPI application. NewMadeleine features sophisticated strategies for sending messages and natively supports multirail network configurations, even heterogeneous ones. It also uses a software element called PIOMan that uses multithreading in order to enhance reactivity and create more efficient progress engines. We show various results that prove that NewMadeleine is indeed well suited as a low-level communication library for building MPI implementations. ASReq AS Req AS Req AS Req Regular Req AS Req Regular Regular Regular Req Req Req
Over the past few years, the architecture of supercomputing platforms has evolved towards more complexity: multicore processors attached to multiple memory banks are now combined with accelerators. Exploiting such architecture often requires to mix programming models (MPI + CUDA for instance). As a result, understanding the performance of an application has become tedious. The use of performance analysis tools, such as tracing tools, now becomes unavoidable to optimize a parallel application. However, analyzing a trace file composed of millions of events requires a tremendous amount of work in order to spot the cause of the poor performance of an application.In this paper, we propose mechanisms for assisting application developers in their exploration of trace files. We propose an algorithm for detecting repetitive patterns of events in trace files. Thanks to this algorithm, a trace can be viewed as loops and groups of events instead of the usual representation as a sequential list of events. We also propose a method to filter traces in order to eliminate duplicated information and to highlight points of interest. These mechanisms allow the performance analysis tool to pre-select the subsets of the trace that are more likely to contain useful information. We implemented the proposed mechanism in the EZTrace performance analysis framework and the experiments show that detecting patterns in various benchmarking applications is done in reasonable time, even when the trace contains millions of events. We also show that the filtering process can reduce the quantity of information in the trace that the user has to analyze by up to 99 %.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
hi@scite.ai
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.