2024
DOI: 10.1109/tpds.2017.2766064
|View full text |Cite
|
Sign up to set email alerts
|

Achieving High Performance on Supercomputers with a Sequential Task-based Programming Model

Abstract: The emergence of accelerators as standard computing resources on supercomputers and the subsequent architectural complexity increase revived the need for high-level parallel programming paradigms. Sequential task-based programming model has been shown to efficiently meet this challenge on a single multicore node possibly enhanced with accelerators, which motivated its support in the OpenMP 4.0 standard. In this paper, we show that this paradigm can also be employed to achieve high performance on modern superco… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1
1

Citation Types

0
74
0
1

Publication Types

Select...
5
2

Relationship

2
5

Authors

Journals

citations
Cited by 59 publications
(88 citation statements)
references
References 35 publications
0
74
0
1
Order By: Relevance
“…Fig. 9 presents the execution traces of dense and TLR Cholesky factorizations, as implemented in task-based Chameleon [1] and HiCMA, respectively. These traces highlight the CPU idle time (red color) in HiCMA, since StarPU is not able to compensate the data movement overhead with the tasks' computations.…”
Section: Performance Resultsmentioning
confidence: 99%
See 1 more Smart Citation
“…Fig. 9 presents the execution traces of dense and TLR Cholesky factorizations, as implemented in task-based Chameleon [1] and HiCMA, respectively. These traces highlight the CPU idle time (red color) in HiCMA, since StarPU is not able to compensate the data movement overhead with the tasks' computations.…”
Section: Performance Resultsmentioning
confidence: 99%
“…HiCMA leverages the tile data descriptor in order to support the new tile low-rank (TLR) compression format. While this data descriptor is paramount to expose parallelism, it is also critical for the data management in distributed-memory environment [1,12]. HiCMA adopts a flattened algorithmic design to bring to the fore the task parallelism, as opposed to plain recursive approach, which has constituted the basis for performance of previous H-matrix libraries [19,17,16].…”
Section: The Hicma Software Librarymentioning
confidence: 99%
“…This is achieved by standardizing existing dynamic runtime system APIs (e.g., OpenMP [32], OmpSs [33], [34], [35], QUARK [36], StarPU [37], [43], PaRSEC [38], SuperMatrix [39]) through a thin layer of abstraction, making the user developer experience oblivious to the underneath runtime system and its corresponding hardware deployment. For instance, this hardware/runtime-oblivious software infrastructure has been already used with StarPU [40], and more recently with OmpSs [41], in the context of computational astronomy applications.…”
Section: The Chameleon Librarymentioning
confidence: 99%
“…This improves the user productivity, and it is even more realistic for runtimes such as StarPU, which are able to transparently handle single heterogeneous nodes, and eventually multiple heterogeneous nodes in case the StarPU-MPI [43] extension is used. To enable such portability, StarPU tasks are associated to codelets which groups under the same name multiple implementations of the same task: CPU, CUDA, OpenCL, OpenMP, etc.…”
Section: The Starpu Dynamic Runtime Systemmentioning
confidence: 99%
“…One of StarPU's strengths is that the system relies on the variety of scheduling strategies available to adapt to applications and platforms with both centralized and distributed solutions. Recently, support for automatically inferring data communication was added to StarPU to help users moving toward distributed architectures …”
Section: Introductionmentioning
confidence: 99%