Nicolas Maillard scite author profile

Abstract-Most recent HPC platforms have heterogeneous nodes composed of multi-core CPUs and accelerators, like GPUs. Programming such nodes is typically based on a combination of OpenMP and CUDA/OpenCL codes; scheduling relies on a static partitioning and cost model.We present the XKaapi runtime system for data-flow task programming on multi-CPU and multi-GPU architectures, which supports a data-flow task model and a localityaware work stealing scheduler. XKaapi enables task multiimplementation on CPU or GPU and multi-level parallelism with different grain sizes. We show performance results on two dense linear algebra kernels, matrix product (GEMM) and Cholesky factorization (POTRF), to evaluate XKaapi on a heterogeneous architecture composed of two hexa-core CPUs and eight NVIDIA Fermi GPUs.Our conclusion is two-fold. First, fine grained parallelism and online scheduling achieve performance results as good as static strategies, and in most cases outperform them. This is due to an improved work stealing strategy that includes locality information; a very light implementation of the tasks in XKaapi; and an optimized search for ready tasks. Next, the multi-level parallelism on multiple CPUs and GPUs enabled by XKaapi led to a highly efficient Cholesky factorization. Using eight NVIDIA Fermi GPUs and four CPUs, we measure up to 2.43 TFlop/s on double precision matrix product and 1.79 TFlop/s on Cholesky factorization; and respectively 5.09 TFlop/s and 3.92 TFlop/s in single precision.

show abstract

Exploiting Concurrent GPU Operations for Efficient Work Stealing on Multi-GPUs

Lima

Gautier

Maillard

et al. 2012

View full text Add to dashboard Cite

International audienceThe race for Exascale computing has naturally led the current technologies to converge to multi-CPU/multi-GPU computers, based on thousands of CPUs and GPUs interconnected by PCI-Express buses or interconnection networks. To exploit this high computing power, programmers have to solve the issue of scheduling parallel programs on hybrid architectures. And, since the performance of a GPU increases at a much faster rate than the throughput of a PCI bus, data transfers must be managed efficiently by the scheduler. This paper targets multi-GPU compute nodes, where several GPUs are connected to the same machine. To overcome the data transfer limitations on such platforms, the available softwares compute, usually before the execution, a mapping of the tasks that respects their dependencies and minimizes the global data transfers. Such an approach is too rigid and it cannot adapt the execution to possible variations of the system or to the application's load. We propose a solution that is orthogonal to the above mentioned: extensions of the XKaapi software stack that enable to exploit full performance of a multi-GPUs system through asynchronous GPU tasks. XKaapi schedules tasks by using a standard Work Stealing algorithm and the runtime efficiently exploits concurrent GPU operations. The runtime extensions make it possible to overlap the data transfers and the task executions on current generation of GPUs. We demonstrate that the overlapping capability is at least as important as computing a scheduling decision to reduce completion time of a parallel program. Our experiments on two dense linear algebra problems (Matrix Product and Cholesky factorization) show that our solution is highly competitive with other softwares based on static scheduling. Moreover, we are able to sustain the peak performance (310 GFlop/s) on DGEMM, even for matrices that cannot be stored entirely in one GPU memory. With eight GPUs, we archive a speed-up of 6.74 with respect to single-GPU. The performance of our Cholesky factorization, with more complex dependencies between tasks, outperforms the state of the art single-GPU MAGMA code

show abstract

Deque-Free Work-Optimal Parallel STL Algorithms

Traoré

Roch

Maillard

et al. 2008

View full text Add to dashboard Cite

Design and analysis of scheduling strategies for multi-CPU and multi-GPU architectures

et al. 2015

View full text Add to dashboard Cite

International audienceIn this paper, we present a comparison of scheduling strategies for heterogeneous multi-CPU and multi-GPU architectures. We designed and evaluated four scheduling strategies on top of XKaapi runtime: work stealing, data-aware work stealing, locality-aware work stealing, and Heterogeneous Earliest-Finish-Time (HEFT). On a heterogeneous architecture with 12 CPUs and 8 GPUs, we analysed our scheduling strategies with four benchmarks: a BLAS-1 AXPY vector operation, a Jacobi 2D iterative computation, and two linear algebra algorithms Cholesky and LU. We conclude that the use of work stealing may be efficient if task annotations are given along with a data locality strategy. Furthermore, our experimental results suggests that HEFT scheduling performs better on applications with very regular computations and low data locality

show abstract

Supporting Malleability in Parallel Architectures with Dynamic CPUSETsMapping and Dynamic MPI

Márcia¹,

Georgiou²,

Richard³

et al. 2010

View full text Add to dashboard Cite

Observing the Impact of Multiple Metrics and Runtime Adaptations on BSP Process Rescheduling

Righi

Pilla

Maillard

et al. 2010

Parallel Process. Lett.

View full text Add to dashboard Cite

Process rescheduling is an useful mechanism to offer runtime load balancing, mainly in dynamic and heterogeneous environments. In this context, we developed a model called MigBSP which controls the process migration on BSP (Bulk Synchronous Parallel) applications. A BSP application is divided in one or more supersteps, each one containing both computation and communication phases followed by a barrier synchronization. Since the barrier waits for the slowest process, MigBSP's final objective is to adjust the processes location in order to reduce the supersteps' times. Its novel ideas are twofold. The former is represented by the combination of three metrics -Memory, Computation and Communication -in order to measure the Potential of Migration of each BSP process. The second idea consists in offering efficient adaptations that work on the rescheduling frequency. Both ideas turn MigBSP a viable model for getting performance on BSP applications. Meanwhile, it provides a low overhead on application execution when migrations do not take place. This paper presents MigBSP's algorithms, the parallel machine organization, some experimental results and related work.

show abstract

Processing Mesoscale Climatology in a Grid Environment

Souto

Ávila

Navaux

et al. 2007

View full text Add to dashboard Cite

Enhancing the quality of weather and climate forecasts are central scientific research objectives worldwide. However, simulations of the atmosphere, usually demand high processing power and large storage resources. In this context, we present the GBRAMS project, that applies grid computing to speed up the generation of a regional model climatology for Brazil. A grid infrastructure was built to perform long-term integrations of a mesoscale numerical model (BRAMS), managing a queue of up to nine independent jobs submitted to three clusters spread over Brazil. Three distinct middlewares, Globus Toolkit, OurGrid and OAR/CIGRI, were compared in their ability to manage these jobs, and results on the usage of each node of the grid are provided. We analyze the impact of the resulted climatology in the accuracy of climate forecast, showing model bias removal which indicates correctness of the generated climatology. Our central contribution are how to use grid computing to speed-up climatology generation and the middleware impact on this enterprise.

show abstract

Improving Performance on Atmospheric Models through a Hybrid OpenMP/MPI Implementation

Osthoff

Grunmann

Boito

et al. 2011

View full text Add to dashboard Cite

This work shows how a Hybrid MPI/OpenMP implementation can improve the performance of the Ocean-LandAtmosphere Model (OLAM) on a multi-core cluster environment, which is a typical HPC many small files workload application. Previous experiments have shown that the scalability of this application on clusters is limited by the performance of the output operations. We show that the Hybrid MPI/OpenMP version of OLAM decreases the number of output files, resulting in better performance for I/O operations. We also observe that the MPI version of OLAM performs better for unbalanced workloads and that further parallel optimizations should be included on the hybrid version in order to improve the parallel execution time of OLAM.

show abstract

12 3 4 5 6

scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.

Contact Info

hi@scite.ai

10624 S. Eastern Ave., Ste. A-614

Henderson, NV 89052, USA

Blog Terms and Conditions API Terms Privacy Policy Contact Cookie Preferences Do Not Sell or Share My Personal Information

Made with 💙 for researchers

Part of the Research Solutions Family.