Experience Report: Writing a Portable GPU Runtime with OpenMP 5.1

Tian, Shilei; Chesterfield, Jon; Doerfert, Johannes; Chapman, Barbara

doi:10.1007/978-3-030-85262-7_11

Cited by 19 publications

(4 citation statements)

References 1 publication

Supporting

Mentioning

Contrasting

Order By: Relevance

“…But the main differences arise from the introduction of an abstraction such as Compute Unit and also the lack of support for a distributed shared memory approach. More recent works have evaluated the OpenMP 5.2 specification, 27,28 and have indicated the lack for appropriate work-distribution schemes for hybrid executions, as well as the nonexistence of support to solve the entanglement between the work distribution and the data placement in a distributed shared-memory architecture.…”

Section: Related Workmentioning

confidence: 99%

Compute units in OpenMP: Extensions for heterogeneous parallel programming

Gonzàlez‐Tallada,

Morancho

2023

Concurrency and Computation

View full text Add to dashboard Cite

SummaryThis article evaluates the current support for heterogeneous OpenMP 5.2 applications regarding the simultaneous activation of host and device computing units (e.g., CPUs, GPUs, or FPGAs). The article identifies limitations in the current OpenMP specification and describes the design and implementation of novel OpenMP extensions and runtime support for heterogeneous parallel programming. The Compute Unit (CUs) abstraction is introduced in the OpenMP programming model. The Compute Unit abstraction is defined in terms of an aggregation of computing elements (e.g., CPUs, GPUs, FPGAs). On top of CUs, the article describes dynamic work sharing constructs and schedulers that address the inherent differences in compute power of host and device CUs. New constructs and the corresponding runtime support are described for the new abstractions. The article evaluates the case of a hybrid multilevel parallelization of the NPB‐MZ benchmark suite. The implementation exploits both coarse‐grain and fine‐grain parallelism, mapped to CUs of different nature (GPUs and CPUs). All CUs are activated using the new extensions and runtime support. We compare hybrid and nonhybrid executions under two state‐of‐the‐art work‐distribution schemes (Static and Dynamic Task schedulers). On a computing node composed of one AMD EPYC 7742 @ 2.250GHz (64 cores and 2 threads/core, totalling 128 threads per node) and 2 GPU AMD Radeon Instinct MI50 with 32GB, hybrid executions present speedups from 1.08 up to 3.18 with respect to a nonhybrid GPU implementation, depending on the number of activated CUs.

show abstract

Section: Related Workmentioning

confidence: 99%

Compute units in OpenMP: Extensions for heterogeneous parallel programming

Gonzàlez‐Tallada,

Morancho

2023

Concurrency and Computation

View full text Add to dashboard Cite

show abstract

“…In a recent experience report Tian et al [28] presented the idea of a portable GPU runtime in order to have support for Nvidia and AMD GPUs. This replacement library can be shipped in Linux distributions LLVM packages, which lowers the entry barrier for OpenMP offloading, because no vendor-specific SDKs are required.…”

Section: Related Workmentioning

confidence: 99%

Evaluating the Performance of OpenMP Offloading on the NEC SX-Aurora TSUBASA Vector Engine

2021

JSFI

View full text Add to dashboard Cite

The NEC SX-Aurora TSUBASA vector engine (VE) follows the tradition of long vector processors for high-performance computing (HPC). The technology combines the vector computing capabilities with the popularity of standard x86 architecture by integrating it as an accelerator.To decrease the burden of code porting for different accelerator types, the OpenMP specification is designed to be single parallel programming model for all of them. Besides the availability of compiler and runtime implementations, the functionality as well as the performance is important for the usability and acceptance of this paradigm. In this work, we present LLVM-based solutions for OpenMP target device offloading from the host to the vector engine and vice versa (reverse offloading). Therefore, we use our source-to-source transformation tool sotoc as well as the native LLVM-VE code path. We assess the functionality and present the first performance numbers of real-world HPC kernels. We discuss the advantages and disadvantage of the different approaches and show that our implementation is competitive to other GPU OpenMP runtime implementations. Our work gives scientific programmers new opportunities and flexibilities for the development of scalable OpenMP offloading applications for SX-Aurora TSUBASA.

show abstract

“…Specifically, within a single CPU+GPU node, general purpose APIs typically used for GPUs include CUDA, OpenCL, OpenACC, 1 and OpenMP. 2 For a GPU cluster, the aforementioned APIs are combined with MPI for example, CUDA+MPI 1 or OpenMP+CUDA+MPI. 3 However, this is a real challenge for many programmers as they often use just a single or only selected APIs.…”

Section: Introductionmentioning

confidence: 99%

“…Efficient and scalable programming such GPU systems requires proper APIs. Specifically, within a single CPU+GPU node, general purpose APIs typically used for GPUs include CUDA, OpenCL, OpenACC, 1 and OpenMP 2 . For a GPU cluster, the aforementioned APIs are combined with MPI for example, CUDA+MPI 1 or OpenMP+CUDA+MPI 3 .…”

Section: Introductionmentioning

confidence: 99%

A multithreaded CUDA and OpenMP based power‐aware programming framework for multi‐node GPU systems

Czarnul

2023

Concurrency and Computation

View full text Add to dashboard Cite

SummaryIn the article, we have proposed a framework that allows programming a parallel application for a multi‐node system, with one or more graphical processing units (GPUs) per node, using an OpenMP+extended CUDA API. OpenMP is used for launching threads responsible for management of particular GPUs and extended CUDA calls allow to transfer data and launch kernels on local and remote GPUs. The framework hides inter‐node MPI communication from the programmer. For optimization, the implementation takes advantage of the MPI_THREAD_MULTIPLE mode allowing: multiple threads handling distinct GPUs as well as overlapping communication and computations transparently using multiple CUDA streams. The solution allows data parallelization across available GPUs in order to minimize execution time and supports a power‐aware mode in which GPUs are automatically selected for computations using a greedy approach in order not to exceed an imposed power limit. We have implemented and benchmarked three parallel applications including: finding the largest divisors; verification of the Collatz conjecture; finding patterns in vectors. These were tested on three various systems: a GPU cluster with 16 nodes, each with NVIDIA GTX 1060 GPU; a powerful 2‐node system—one node with 8 NVIDIA Quadro RTX 6000 GPUs, the second with 4 NVIDIA Quadro RTX 5000 GPUs; a heterogeneous environment with one node with 2 NVIDIA RTX 2080 and 2 nodes with NVIDIA GTX 1060 GPUs. We demonstrated effectiveness of the framework through execution times versus power caps within ranges of 100–1400 W, 250–3000 W, and 125–600 W for these systems respectively as well as gains from using two versus one CUDA streams per GPU. Finally, we have shown that for the testbed applications the solution allows to obtain high speed‐ups between 89.3% and 97.4% of the theoretically assessed ideal ones, for 16 nodes and 2 CUDA streams, demonstrating very good parallel efficiency.

show abstract

Experience Report: Writing a Portable GPU Runtime with OpenMP 5.1

Cited by 19 publications

References 1 publication

Compute units in OpenMP: Extensions for heterogeneous parallel programming

Compute units in OpenMP: Extensions for heterogeneous parallel programming

Evaluating the Performance of OpenMP Offloading on the NEC SX-Aurora TSUBASA Vector Engine

A multithreaded CUDA and OpenMP based power‐aware programming framework for multi‐node GPU systems

Contact Info

Product

Resources

About