Achieving High Performance on Supercomputers with a Sequential Task-based Programming Model

Agullo, Emmanuel; Aumage, Olivier; Faverge, Mathieu; Furmento, Nathalie; Pruvost, Florent; Sergent, Marc; Thibault, Samuel

doi:10.1109/tpds.2017.2766064

Cited by 59 publications

(88 citation statements)

References 35 publications

Supporting

Mentioning

Contrasting

Unclassified

Order By: Relevance

“…Fig. 9 presents the execution traces of dense and TLR Cholesky factorizations, as implemented in task-based Chameleon [1] and HiCMA, respectively. These traces highlight the CPU idle time (red color) in HiCMA, since StarPU is not able to compensate the data movement overhead with the tasks' computations.…”

Section: Performance Resultsmentioning

confidence: 99%

“…HiCMA leverages the tile data descriptor in order to support the new tile low-rank (TLR) compression format. While this data descriptor is paramount to expose parallelism, it is also critical for the data management in distributed-memory environment [1,12]. HiCMA adopts a flattened algorithmic design to bring to the fore the task parallelism, as opposed to plain recursive approach, which has constituted the basis for performance of previous H-matrix libraries [19,17,16].…”

Section: The Hicma Software Librarymentioning

confidence: 99%

See 1 more Smart Citation

Exploiting Data Sparsity for Large-Scale Matrix Computations

Akbudak

Ltaief

Михалев

et al. 2018

Euro-Par 2018: Parallel Processing

View full text Add to dashboard Cite

Abstract. Exploiting data sparsity in dense matrices is an algorithmic bridge between architectures that are increasingly memory-austere on a per-core basis and extreme-scale applications. The Hierarchical matrix Computations on Manycore Architectures (HiCMA) library tackles this challenging problem by achieving significant reductions in time to solution and memory footprint, while preserving a specified accuracy requirement of the application. HiCMA provides a high-performance implementation on distributed-memory systems of one of the most widely used matrix factorization in large-scale scientific applications, i.e., the Cholesky factorization. It employs the tile low-rank data format to compress the dense data-sparse off-diagonal tiles of the matrix. It then decomposes the matrix computations into interdependent tasks and relies on the dynamic runtime system StarPU for asynchronous out-of-order scheduling, while allowing high user-productivity. Performance comparisons and memory footprint on matrix dimensions up to eleven million show a performance gain and memory saving of more than an order of magnitude for both metrics on thousands of cores, against state-of-the-art open-source and vendor optimized numerical libraries. This represents an important milestone in enabling large-scale matrix computations toward solving big data problems in geospatial statistics for climate/weather forecasting applications.

show abstract

Section: Performance Resultsmentioning

confidence: 99%

Section: The Hicma Software Librarymentioning

confidence: 99%

Exploiting Data Sparsity for Large-Scale Matrix Computations

Akbudak

Ltaief

Михалев

et al. 2018

Euro-Par 2018: Parallel Processing

View full text Add to dashboard Cite

show abstract

“…This is achieved by standardizing existing dynamic runtime system APIs (e.g., OpenMP [32], OmpSs [33], [34], [35], QUARK [36], StarPU [37], [43], PaRSEC [38], SuperMatrix [39]) through a thin layer of abstraction, making the user developer experience oblivious to the underneath runtime system and its corresponding hardware deployment. For instance, this hardware/runtime-oblivious software infrastructure has been already used with StarPU [40], and more recently with OmpSs [41], in the context of computational astronomy applications.…”

Section: The Chameleon Librarymentioning

confidence: 99%

“…This improves the user productivity, and it is even more realistic for runtimes such as StarPU, which are able to transparently handle single heterogeneous nodes, and eventually multiple heterogeneous nodes in case the StarPU-MPI [43] extension is used. To enable such portability, StarPU tasks are associated to codelets which groups under the same name multiple implementations of the same task: CPU, CUDA, OpenCL, OpenMP, etc.…”

Section: The Starpu Dynamic Runtime Systemmentioning

confidence: 99%

Asynchronous Task-Based Polar Decomposition on Single Node Manycore Architectures

Sukkari

Ltaief²,

Faverge³

et al. 2018

IEEE Trans. Parallel Distrib. Syst.

Self Cite

View full text Add to dashboard Cite

Abstract-This paper introduces the first asynchronous, task-based formulation of the polar decomposition and its corresponding implementation on manycore architectures. Based on a new formulation of the iterative QR dynamically-weighted Halley algorithm (QDWH) for the calculation of the polar decomposition, the proposed implementation replaces the original and hostile LU factorization for the condition number estimator by the more adequate QR factorization to enable software portability across various architectures. Relying on fine-grained computations, the novel task-based implementation is also capable of taking advantage of the identity structure of the matrix involved during the QDWH iterations, which decreases the overall algorithmic complexity. Furthermore, the artifactual synchronization points have been weakened compared to previous implementations, unveiling look-ahead opportunities for better hardware occupancy. The overall QDWH-based polar decomposition can then be represented as a directed acyclic graph (DAG), where nodes represent computational tasks and edges define the inter-task data dependencies. The StarPU dynamic runtime system is employed to traverse the DAG, to track the various data dependencies and to asynchronously schedule the computational tasks on the underlying hardware resources, resulting in an out-of-order task scheduling. Benchmarking experiments show significant improvements against existing state-of-the-art high performance implementations (i.e., Intel MKL and Elemental) for the polar decomposition on latest shared-memory vendors' systems (i.e., Intel Haswell/Broadwell/Knights Landing, NVIDIA K80/P100 GPUs and IBM Power8), while maintaining high numerical accuracy.

show abstract

“…One of StarPU's strengths is that the system relies on the variety of scheduling strategies available to adapt to applications and platforms with both centralized and distributed solutions. Recently, support for automatically inferring data communication was added to StarPU to help users moving toward distributed architectures …”

Section: Introductionmentioning

confidence: 99%

Evaluation of dataflow programming models for electronic structure theory

Jagode

Danalis

Hoque

et al. 2018

Concurrency and Computation

Self Cite

View full text Add to dashboard Cite

Summary Dataflow programming models have been growing in popularity as a means to deliver a good balance between performance and portability in the post‐petascale era. In this paper, we evaluate different dataflow programming models for electronic structure methods and compare them in terms of programmability, resource utilization, and scalability. In particular, we evaluate two programming paradigms for expressing scientific applications in a dataflow form: (1) explicit dataflow, where the dataflow is specified explicitly by the developer, and (2) implicit dataflow, where a task scheduling runtime derives the dataflow using per‐task data‐access information embedded in a serial program. We discuss our findings and present a thorough experimental analysis using methods from the NWChem quantum chemistry application as our case study, and OpenMP, StarPU, and PaRSEC as the task‐based runtimes that enable the different forms of dataflow execution. Furthermore, we derive an abstract model to explore the limits of the different dataflow programming paradigms.

show abstract

Achieving High Performance on Supercomputers with a Sequential Task-based Programming Model

Cited by 59 publications

References 35 publications

Exploiting Data Sparsity for Large-Scale Matrix Computations

Exploiting Data Sparsity for Large-Scale Matrix Computations

Asynchronous Task-Based Polar Decomposition on Single Node Manycore Architectures

Evaluation of dataflow programming models for electronic structure theory

Contact Info

Product

Resources

About