Guillaume Pallez scite author profile

et al. 2020

In HPC platforms, concurrent applications are sharing the same file system. This can lead to conflicts, especially as applications are more and more data intensive. I/O contention can represent a performance bottleneck. The access to bandwidth can be split in two complementary yet distinct problems. The mapping problem and the scheduling problem. The mapping problem consists in selecting the set of applications that are in competition for the I/O resource. The scheduling problem consists then, given I/O requests on the same resource, in determining the order to these accesses to minimize the I/O time. In this work we propose to couple a novel bandwidth-aware mapping algorithm to I/O list-scheduling policies to develop a cross-layer optimization solution.We study this solution experimentally using an I/O middleware: CLARISSE. We show that naive policies such as FIFO perform relatively well in order to schedule I/O movements, and that the important part to reduce congestion lies mostly on the mapping part. We evaluate the algorithm that we propose using a simulator that we validated experimentally. This evaluation shows important gains for the simple, bandwidth-aware mapping solution that we provide compared to its non bandwidth-aware counterpart. The gains are both in terms of machine efficiency (makespan) and application efficiency (stretch). This stresses even more the importance of designing efficient, bandwidth-aware mapping strategies to alleviate the cost of I/O congestion.

A New Framework for Evaluating Straggler Detection Mechanisms in MapReduce

Phan

ACM Trans. Model. Perform. Eval. Comput. Syst.

Ibrahim

et al. 2019

Big Data systems (e.g., Google MapReduce, Apache Hadoop, Apache Spark) rely increasingly on speculative execution to mask slow tasks, also known as stragglers, because a job's execution time is dominated by the slowest task instance. Big Data systems typically identify stragglers and speculatively run copies of those tasks with the expectation that a copy may complete faster to shorten job execution times. There is a rich body of recent results on straggler mitigation in MapReduce. However, the majority of these do not consider the problem of accurately detecting stragglers. Instead, they adopt a particular straggler detection approach and then study its effectiveness in terms of performance, e.g., reduction in job completion time, or efficiency, e.g., high resource utilization. In this paper, we consider a complete framework for straggler detection and mitigation. We start with a set of metrics that can be used to characterize and detect stragglers including Precision, Recall, Detection Latency, Undetected Time and Fake Positive. We then develop an architectural model by which these metrics can be linked to measures of performance including execution time and system energy overheads. We further conduct a series of experiments to demonstrate which metrics and approaches are more effective in detecting stragglers and are also predictive of effectiveness in terms of performance and energy efficiencies. For example, our results indicate that the default Hadoop straggler detector could be made more effective. In certain case, Precision is low and only 55% of those detected are actual stragglers and the Recall, i.e., percent of actual detected stragglers, is also relatively low at 56%. For the same case, the hierarchical approach (i.e., a green-driven detector based on the default one) achieves a Precision of 99% and a Recall of 29%. This increase in Precision can be translated to achieve lower execution time and energy consumption, and thus higher performance and energy efficiency; compared to the default Hadoop mechanism, the energy consumption is reduced by almost 31%. These results demonstrate how our framework can offer useful insights and be applied in practical settings to characterize and design new straggler detection mechanisms for MapReduce systems. This work is supported by the ANR KerStream project (ANR-16-CE25-0014-01) and the Stack/Apollo connect talent project. The experiments presented in this paper were carried out using the Grid'5000/ALADDIN-G5K experimental testbed, an initiative from the French Ministry of Research through the ACI GRID incentive action, INRIA, CNRS and RENATER and other contributing partners (see http://www.grid5000.fr/ for details).

Optimal memory-aware backpropagation of deep join networks

Beaumont

Herrmann

Phil. Trans. R. Soc. A.

et al. 2020

Deep learning training memory needs can prevent the user from considering large models and large batch sizes. In this work, we propose to use techniques from memory-aware scheduling and automatic differentiation (AD) to execute a backpropagation graph with a bounded memory requirement at the cost of extra recomputations. The case of a single homogeneous chain, i.e. the case of a network whose stages are all identical and form a chain, is well understood and optimal solutions have been proposed in the AD literature. The networks encountered in practice in the context of deep learning are much more diverse, both in terms of shape and heterogeneity. In this work, we define the class of backpropagation graphs, and extend those on which one can compute in polynomial time a solution that minimizes the total number of recomputations. In particular, we consider join graphs which correspond to models such as siamese or cross-modal networks. This article is part of a discussion meeting issue ‘Numerical algorithms for high-performance computational science’.

H-R evolve

Herrmann

ACM Trans. Math. Softw.

2020

We study the problem of checkpointing strategies for adjoint computation on synchronous hierarchical platforms, specifically computational platforms with several levels of storage with different writing and reading costs. When reversing a large adjoint chain, choosing which data to checkpoint and where is a critical decision for the overall performance of the computation. We introduce H-Revolve, an optimal algorithm for this problem. We make it available in a public Python library along with the implementation of several state-ofthe-art algorithms for the variant of the problem with two levels of storage. We provide a detailed description of how one can use this library in an adjoint computation software in the field of automatic differentiation or backpropagation. Finally, we evaluate the performance of H-Revolve and other checkpointing heuristics though an extensive campaign of simulation.

Robustness of the Young/Daly formula for stochastic iterative applications

Marchal

et al. 2020

The Young/Daly formula for periodic checkpointing is known to hold for a divisible load application where one can checkpoint at any time-step. In an nutshell, the optimal period is P YD = 2µ f C where µ f is the Mean Time Between Failures (MTBF) and C is the checkpoint time. This paper assesses the accuracy of the formula for applications decomposed into computational iterations where: (i) the duration of an iteration is stochastic, i.e., obeys a probability distribution law D of mean µ D ; and (ii) one can checkpoint only at the end of an iteration. We first consider static strategies where checkpoints are taken after a given number of iterations k and provide a closed-form, asymptotically optimal, formula for k, valid for any distribution D. We then show that using the Young/Daly formula to compute k (as k•µ D = P YD ) is a first order approximation of this formula. We also consider dynamic strategies where one decides to checkpoint at the end of an iteration only if the total amount of work since the last checkpoint exceeds a threshold W th , and otherwise proceed to the next iteration. Similarly, we provide a closed-form formula for this threshold and show that P YD is a first-order approximation of W th . Finally, we provide an extensive set of simulations where D is either Uniform, Gamma or truncated Normal, which shows the global accuracy of the Young/Daly formula, even when the distribution D had a large standard deviation (and when one cannot use a first-order approximation). Hence we establish that the relevance of the formula goes well beyond its original framework.

Scheduling on Two Unbounded Resources with Communication Costs

Ait

Kordon

2019

Heterogeneous computing systems became a popular and powerful platform, containing several heterogeneous computing elements (e.g. CPU+GPU). In this paper, we consider that we have two platforms, each with an unbounded number of processors. We want to execute an application represented as a Directed acyclic Graph (DAG) using these two platforms. Each task of the application has two possible execution times, depending on the platform it is executed on. Also, there is a cost to transfer data from one platform to another between successive tasks. The goal here is to minimize the finish execution time of the last task of the application (usually called makespan). We show that the problem is NP-complete for graphs of depth at least 3 but polynomial for graphs of depth at most 2. Finally, we focus on particular classes of graphs, by providing polynomial-time algorithms for bi-partite graphs, trees and 2-series-parallel graphs with different assumptions on communication delays.Résumé : Les systèmes de calculs hétérogènes (par exemple CPU+GPU) sont des plateformes populaires. Dans ce travail, nous considérons une machine avec deux plateformes homogènes de calcul, chacune contenant un nombre illimité de ressources de calcul. Nous cherchons à exécuter une application représentée par un graphe de dépendance dirigé et acyclique sur ces plateformes. Chaque tâche de l'application a deux possible modèle d'exécution en fonction de la plateforme sur laquelles elles sont exécutées. En plus nous considérons un coût de communication entre deux tâches successives si elles ne sont pas exécutées sur la même plateforme. Nous travaillons à minimiser le temps d'exécution de l'application.Nous montrons que le problème est NP-complet pour les graphes de profondeur au moins trois, mais polynomial pour les graphes de profondeur au plus deux. En plus, nous montrons qu'il est possible de calculer des solutions optimales en temps polynomial pour certaines classes de graphes définies récursivement (arbres, graphes série-parallèles).

Optimal Checkpointing Strategies for Iterative Applications

Marchal

IEEE Trans. Parallel Distrib. Syst.

et al. 2022

This work provides an optimal checkpointing strategy to protect iterative applications from fail-stop errors. We consider a very general framework, where the application repeats the same execution pattern by executing consecutive iterations, and where each iteration is composed of several tasks. These tasks have different execution lengths and different checkpoint costs. Assume that there are n tasks and that task a i , where 0 ≤ i < n, has execution time t i and checkpoint cost C i . A naive strategy would checkpoint after each task. A strategy inspired by the Young/Daly formula would select the task a min with smallest checkpoint cost C min and would checkpoint after every p th instance of that task, leading to a checkpointing period P Y D = pT where T = n−1 i=0 a i is the time per iteration. One would choose the period so that P Y D = pT ≈ √ 2µC min to obey the Young/Daly formula, where µ is the application MTBF. Both the naive and Young/Daly strategies are suboptimal. Our main contribution is to show that the optimal checkpoint strategy is globally periodic, and to design a dynamic programming algorithm that computes the optimal checkpointing pattern. This pattern may well checkpoint many different tasks, and this across many different iterations. We show through simulations, both from synthetic and real-life application scenarios, that the optimal strategy significantly outperforms the naive and Young/Daly strategies.

IEEE Trans. Parallel Distrib. Syst.

Profiles of Upcoming HPC Applications and Their Impact on Reservation Strategies

Gainaru

Goglin

Honoré

et al. 2021

With the expected convergence between HPC, BigData and AI, new applications with different profiles are coming to HPC infrastructures. We aim at better understanding the features and needs of these applications in order to be able to run them efficiently on HPC platforms. The approach followed is bottom-up: we study thoroughly an emerging application, Spatially Localized Atlas Network Tiles (SLANT, originating from the neuroscience community) to understand its behavior. Based on these observations, we derive a generic, yet simple, application model (namely, a linear sequence of stochastic jobs). We expect this model to be representative for a large set of upcoming applications that require the computational power of HPC clusters without fitting the typical behavior of large-scale traditional applications. In a second step, we show how one can manipulate this generic model in a scheduling framework. Specifically we consider the problem of making reservations (both time and memory) for an execution on an HPC platform. We derive solutions using the model of the first step of this work. We experimentally show the robustness of the model, even with very few data or with another application, to generate the model, and provide performance gains with regards to standard and more recent approaches used in the neuroscience community.