Scheduling Support for Communicating Parallel Tasks

Dümmler, Jörg; Rauber, Thomas; Rünger, Gudula

doi:10.1007/978-3-642-36036-7_17

Cited by 2 publications

(4 citation statements)

References 19 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Applications have been coded and compiled within the ROCm-3.5.0 framework and llvm 12 compiler suite. The CPU code has been compiled combining a C++ NPB-MZ implementation (Dümmler and Rünger 2013) and the original NPB-MZ Fortran implementation (der Wijngaart and Jin 2003) to generate a version compatible with the ROCm implementation of the applications. All experiments have been performed in a system composed of AMD EPYC 7742 @ 2.250 GHz (64 cores and 2 threads/core, totalling 128 threads per node) and 2 × GPU AMD Radeon Instinct MI50 with 32 GB.…”

Section: Discussionmentioning

confidence: 99%

See 1 more Smart Citation

Heterogeneous programming using OpenMP and CUDA/HIP for hybrid CPU-GPU scientific applications

Tallada

Morancho

2023

The International Journal of High Performance Computing Applica

View full text Add to dashboard Cite

Hybrid computer systems combine compute units (CUs) of different nature like CPUs, GPUs and FPGAs. Simultaneously exploiting the computing power of these CUs requires a careful decomposition of the applications into balanced parallel tasks according to both the performance of each CU type and the communication costs among them. This paper describes the design and implementation of runtime support for OpenMP hybrid GPU-CPU applications, when mixed with GPU-oriented programming models (e.g. CUDA/HIP). The paper describes the case for a hybrid multi-level parallelization of the NPB-MZ benchmark suite. The implementation exploits both coarse-grain and fine-grain parallelism, mapped to compute units of different nature (GPUs and CPUs). The paper describes the implementation of runtime support to bridge OpenMP and HIP, introducing the abstractions of Computing Unit and Data Placement. We compare hybrid and non-hybrid executions under state-of-the-art schedulers for OpenMP: static and dynamic task schedulings. Then, we improve the set of schedulers with two additional variants: a memorizing-dynamic task scheduling and a profile-based static task scheduling. On a computing node composed of one AMD EPYC 7742 @ 2.250 GHz (64 cores and 2 threads/core, totalling 128 threads per node) and 2 × GPU AMD Radeon Instinct MI50 with 32 GB, hybrid executions present speedups from 1.10× up to 3.5× with respect to a non-hybrid GPU implementation, depending on the number of activated CUs.

show abstract

Section: Discussionmentioning

confidence: 99%

“…Dümmler and Rünger (2013) evaluated NPB-MZ benchmarks on hybrid CPU + GPU architectures. They decompose the workloads and, using a static scheduling, distribute them among the CPU’s or the GPU.…”

Section: Related Workmentioning

confidence: 99%

Heterogeneous programming using OpenMP and CUDA/HIP for hybrid CPU-GPU scientific applications

Tallada

Morancho

2023

The International Journal of High Performance Computing Applica

View full text Add to dashboard Cite

show abstract

“…All applications have been coded combining OpenMP 5.2 and the ROCM‐3.5.0 framework and compiled with llvm 12 . The CPU code has been implemented combining a C++ NPB‐MZ implementation 8 and the original NPB‐MZ Fortran implementation 5 to generate a version compatible with the ROCM implementation of the applications. The input mesh sizes correspond to class D using a total 13GB of memory and 1024 zones for SP‐MZ and BT‐MZ, and 16 zones for LU‐MZ.…”

Section: Discussionmentioning

confidence: 99%

“…NPB‐MZ studies : Dümmler and Rünger 8 evaluated NPB‐MZ benchmarks on hybrid CPU+GPU architectures. Workloads are decomposed and, using a static scheduling, distributed among CPUs and GPUs.…”

Section: Related Workmentioning

confidence: 99%

Compute units in OpenMP: Extensions for heterogeneous parallel programming

Gonzàlez‐Tallada,

Morancho

2023

Concurrency and Computation

View full text Add to dashboard Cite

SummaryThis article evaluates the current support for heterogeneous OpenMP 5.2 applications regarding the simultaneous activation of host and device computing units (e.g., CPUs, GPUs, or FPGAs). The article identifies limitations in the current OpenMP specification and describes the design and implementation of novel OpenMP extensions and runtime support for heterogeneous parallel programming. The Compute Unit (CUs) abstraction is introduced in the OpenMP programming model. The Compute Unit abstraction is defined in terms of an aggregation of computing elements (e.g., CPUs, GPUs, FPGAs). On top of CUs, the article describes dynamic work sharing constructs and schedulers that address the inherent differences in compute power of host and device CUs. New constructs and the corresponding runtime support are described for the new abstractions. The article evaluates the case of a hybrid multilevel parallelization of the NPB‐MZ benchmark suite. The implementation exploits both coarse‐grain and fine‐grain parallelism, mapped to CUs of different nature (GPUs and CPUs). All CUs are activated using the new extensions and runtime support. We compare hybrid and nonhybrid executions under two state‐of‐the‐art work‐distribution schemes (Static and Dynamic Task schedulers). On a computing node composed of one AMD EPYC 7742 @ 2.250GHz (64 cores and 2 threads/core, totalling 128 threads per node) and 2 GPU AMD Radeon Instinct MI50 with 32GB, hybrid executions present speedups from 1.08 up to 3.18 with respect to a nonhybrid GPU implementation, depending on the number of activated CUs.

show abstract

Scheduling Support for Communicating Parallel Tasks

Cited by 2 publications

References 19 publications

Heterogeneous programming using OpenMP and CUDA/HIP for hybrid CPU-GPU scientific applications

Heterogeneous programming using OpenMP and CUDA/HIP for hybrid CPU-GPU scientific applications

Compute units in OpenMP: Extensions for heterogeneous parallel programming

Contact Info

Product

Resources

About