Integrating GPU support for OpenMP offloading directives into Clang

Bertolli, Carlo; Antão, Samuel; Bercea, Gheorghe-Teodor; Jacob, Arpith C.; Eichenberger, Alexandre E.; Chen, Tong; Sura, Zehra; Sung, Hyojin; Rokos, Georgios; Appelhans, David; O’Brien, Kevin

doi:10.1145/2833157.2833161

Cited by 42 publications

(16 citation statements)

References 7 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…The introduction of such features into this prominent open standard demonstrates that there is an ever-increasing acceptance that such architectures will become a permanent feature in modern supercomputing. Although the specification has been in existence since the middle of 2013, compiler support for the heterogeneous features has been limited to a number of experimental open source implementations until more recently [3], [4], [5]. Until now, the principal use of OpenMP 4.0 has been for targeting the Intel Xeon Phi Knights Corner (KNC) architecture, but future releases of the Intel Xeon Phi architecture, such as the Knights Landing, are going to self-host, removing the requirement for an offloading model.…”

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

Evaluating OpenMP 4.0's Effectiveness as a Heterogeneous Parallel Programming Model

Martineau¹,

McIntosh–Smith²,

Gaudin

2016

2016 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW)

View full text Add to dashboard Cite

Abstract-Although the OpenMP 4.0 standard has been available since 2013, support for GPUs has been absent up until very recently, with only a handful of experimental compilers available. In this work we evaluate the performance of Cray's new NVIDIA GPU targeting implementation of OpenMP 4.0, with the mini-apps TeaLeaf, CloverLeaf and BUDE. We successfully port each of the applications, using a simple and consistent design throughout, and achieve performance on an NVIDIA K20X that is comparable to Cray's OpenACC in all cases. BUDE, a compute bound code, required 2.2x the runtime of an equivalently optimised CUDA code, which we believe is caused by an inflated frequency of control flow operations and less efficient arithmetic optimisation. Impressively, both TeaLeaf and CloverLeaf, memory bandwidth bound codes, only required 1.3x the runtime of hand-optimised CUDA implementations. Overall, we find that OpenMP 4.0 is a highly usable open standard capable of performant heterogeneous execution, making it a promising option for scientific application developers.

show abstract

Section: Introductionmentioning

confidence: 99%

“…Ozen et al [4] partially implemented OpenMP 4.0 in the OmpSs compiler and performed a performance evaluation with three kernels. Bertolli et al [3] and Bercea et al [18] implemented GPU support for Clang using the OpenMP 4.0 specification, and presented performance results for a representative set of kernels in LULESH.…”

mentioning

confidence: 99%

Evaluating OpenMP 4.0's Effectiveness as a Heterogeneous Parallel Programming Model

Martineau¹,

McIntosh–Smith²,

Gaudin

2016

2016 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW)

View full text Add to dashboard Cite

show abstract

“…Bertolli et al [3] discuss the coordination of threads within an NVIDIA GPU, and show that their novel approach limits the impact on code generation when integrated into the LLVM compiler infrastructure. They later discussed their approach to integrating OpenMP 4.5 offloading for NVIDIA GPUs into Clang [2].…”

Section: Concluding Suggestions For Performance Portabilitymentioning

confidence: 99%

“…Some experimental compilers were developed in the interim, with the most notable being the Clang OpenMP 4.5 project, which was contributed to by a number of collaborators, including AMD, IBM, Intel, and NVIDIA. In particular, the GPU targeting functionality was developed by IBM, who are actively migrating this functionality into the main trunk of Clang [2]. In September 2015, the Cray Compiling Environment version 8.4 introduced the first official vendor support for OpenMP 4.0 on NVIDIA GPUs, with full support for version 4.0 of the specification.…”

Section: Introductionmentioning

confidence: 99%

Pragmatic Performance Portability with OpenMP 4.x

Martineau

Price

McIntosh–Smith

et al. 2016

OpenMP: Memory, Devices, and Tasks

View full text Add to dashboard Cite

Abstract. In this paper we investigate the current compiler technologies supporting OpenMP 4.x features targeting a range of devices, in particular, the Cray compiler 8.5.0 targeting an Intel Xeon Broadwell and NVIDIA K20x, IBM's OpenMP 4.5 Clang branch (clang-ykt) targeting an NVIDIA K20x, the Intel compiler 16 targeting an Intel Xeon Phi Knights Landing, and GCC 6.1 targeting an AMD APU. We outline the mechanisms that they use to map the OpenMP model onto their target architectures, and conduct performance testing with a number of representative data parallel kernels. Following this we present a discussion about the current state of play in terms of performance portability and propose some straightforward guidelines for writing performance portable code, derived from our observations. At the time of writing, developers will likely have to rely on the pre-processor for certain kernels to achieve functional portability, but we expect that future homogenisation of required directives between compilers and architectures is feasible.

show abstract

“…e existing support for OpenMP target regions is built on top of the host implementation of OpenMP and is confined, almost exclusively, to the Clang frontend code generation module. e most recent code generation scheme for OpenMP target regions is detailed in [5] and is based on previous work [1,3,4] covering data-parallel cases [2,6] as well as nested parallelism [5].…”

Section: Introductionmentioning

confidence: 99%

Implementing implicit OpenMP data sharing on GPUs

Bercea¹,

Bertolli²,

Jacob³

et al. 2017

Proceedings of the Fourth Workshop on the LLVM Compiler Infrastructure in HPC

Self Cite

View full text Add to dashboard Cite

OpenMP is a shared memory programming model which supports the offloading of target regions to accelerators such as NVIDIA GPUs. e implementation in Clang/LLVM aims to deliver a generic GPU compilation toolchain that supports both the native CUDA C/C++ and the OpenMP device offloading models.ere are situations where the semantics of OpenMP and those of CUDA diverge. One such example is the policy for implicitly handling local variables. In CUDA, local variables are implicitly mapped to thread local memory and thus become private to a CUDA thread. In OpenMP, due to semantics that allow the nesting of regions executed by different numbers of threads, variables need to be implicitly shared among the threads of a contention group.In this paper we introduce a re-design of the OpenMP device data sharing infrastructure that is responsible for the implicit sharing of local variables in the Clang/LLVM toolchain. We introduce a new data sharing infrastructure that lowers implicitly shared variables to the shared memory of the GPU.We measure the amount of shared memory used by our scheme in cases that involve scalar variables and statically allocated arrays.e evaluation is carried out by offloading to K40 and P100 NVIDIA GPUs. For scalar variables the pressure on shared memory is relatively low, under 26% of shared memory utilization for the K40, and does not negatively impact occupancy. e limiting occupancy factor in that case is register pressure. e data sharing scheme offers the users a simple memory model for controlling the implicit allocation of device shared memory.

show abstract

Integrating GPU support for OpenMP offloading directives into Clang

Cited by 42 publications

References 7 publications

Evaluating OpenMP 4.0's Effectiveness as a Heterogeneous Parallel Programming Model

Evaluating OpenMP 4.0's Effectiveness as a Heterogeneous Parallel Programming Model

Pragmatic Performance Portability with OpenMP 4.x

Implementing implicit OpenMP data sharing on GPUs

Contact Info

Product

Resources

About