OpenMP 4.5 compiler optimization for GPU offloading

Tiotto, Ettore; Mahjour, Bardia; Tsang, Whitney; Xue, Xing; Islam, Tarique; Chen, Wang

doi:10.1147/jrd.2019.2962428

Cited by 16 publications

(4 citation statements)

References 1 publication

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Any wavefront can be executed on a single SIMD for four cycles [11] , if each cycle executes one command, the wavefront will be divided into 4 batches, each batch occupies 16 channels, and the execution is completed for all 64 channels. A CU has four vector units, so the total throughput is 64 single-precision operations per cycle [12] . The hardware architecture of its computing unit is shown in Figure 5.…”

Section: Dcu Architecturementioning

confidence: 99%

Compilation Optimization of DCU-oriented OpenMP Thread Scheduling

Zhou

Zhao

et al. 2023

J. Phys.: Conf. Ser.

View full text Add to dashboard Cite

OpenMP is one of the mainstream parallel programming models in recent years. After version 4.0, OpenMP introduced a new target instruction to increase the functionality of heterogeneous programming, called OpenMP Offload. For the domestic heterogeneous platform DCU, the thread scheduling algorithm under OpenMP parallel computing has low performance in the default mode, which does not take the best advantage of GPU parallel computing and has wasted resources. To address this problem, this paper performs algorithm improvement at the compiler level, analyzes the available resources of the system by combining the DCU hardware facilities, then further parses the program based on its array information to get its program iteration number, reallocates the number of threads for different execution modes in OpenMP, and optimizes the thread group increase factor by combining the DCU hardware information to adjust the thread. This paper uses the SPEC ACCEL test set to optimize the number of threads in the DCU. In this paper, we use the SPEC ACCEL test set and Polybench standard test set to test the redistribution of threads and thread groups in two parallel modes using the thread scheduling optimization algorithm. The average speedup ratio of ACCEL was improved by 40%.

show abstract

Section: Dcu Architecturementioning

confidence: 99%

Compilation Optimization of DCU-oriented OpenMP Thread Scheduling

Zhou

Zhao

et al. 2023

J. Phys.: Conf. Ser.

View full text Add to dashboard Cite

show abstract

“…The DCU has 16GB of global memory and 64KB of shared memory within each CU. The experimental tests were carried out on a single node single DCU card 11 .…”

Section: Test Environmentmentioning

confidence: 99%

DCU oriented OpenMP offload register optimization method

Chai

Gao²,

Lin³

et al. 2023

2022 2nd Conference on High Performance Computing and Communication Engineering (HPCCE 2022)

View full text Add to dashboard Cite

As a mainstream heterogeneous programming model, OpenMP has important practical significance for its uninstall performance research. Songshan supercomputer system installed in Zhengzhou Supercomputing Center is a new generation of E-class high-performance computer cluster independently developed by China, and the DCU chip installed on it is also home-made. In order to improve the offload performance of OpenMP on the platform and make full use of hardware resources such as registers, a redundant cycle optimization for thread iteration in DCU was proposed, so that the thread could release register resources in time after completing the calculation task, thus relieving the back-end register allocation pressure and improving the program performance. At the same time, based on the loop unrolling optimization algorithm in LLVM and combined with the hardware characteristics and instruction set characteristics of the domestic platform, a better algorithm for calculating the loop unrolling factor was proposed to improve the optimization effect of loop unrolling. Thread iteration optimization using SPEC ACCEL and Polybench resulted in an average 33.7% reduction in the overall register count and a 37% average performance improvement after loop expansion optimization.

show abstract

“…ExaHyPE, an Exascale Hyperbolic PDE design [30] used a pragma-based GPU parallelization approach for object-oriented code, and documented lessons learned. Several other related works include demonstrating GPU support for OpenMP offloading features in compilers in Flang/Clang [3,25], a proof-ofconcept implementation of offloading for FPGA based accelerators [14,26], and an interprocedural statical analysis heuristic at runtime to select optimal grid sizes for offloaded target team constructs [27], among others. There are publicly available benchmark suites to evaluate heterogeneous application performance, e.g.…”

Section: Related Workmentioning

confidence: 99%

“…The other compilers do not generate excess data movement for su3-v0 because of compiler optimization passes, e.g. the XL compiler uses interprocedural static compiler analysis to determine that all threads in a team execute the same code [27]. The second mini-app, gppnaive, sums data contributions over 4 nested loops.…”

Section: Performance Issues Across Compilersmentioning

confidence: 99%

Performance Assessment of OpenMP Compilers Targeting NVIDIA V100 GPUs

Davis

Daley

Pophale

et al. 2021

Accelerator Programming Using Directives

View full text Add to dashboard Cite

Heterogeneous systems are becoming increasingly prevalent. In order to exploit the rich compute resources of such systems, robust programming models are needed for application developers to seamlessly migrate legacy code from today's systems to tomorrow's. Over the past decade and more, directives have been established as one of the promising paths to tackle programmatic challenges on emerging systems. This work focuses on applying and demonstrating OpenMP offloading directives on five proxy applications. We observe that the performance varies widely from one compiler to the other; a crucial aspect of our work is reporting best practices to application developers who use OpenMP offloading compilers. While some issues can be worked around by the developer, there are other issues that must be reported to the compiler vendors. By restructuring OpenMP offloading directives, we gain an 18x speedup for the su3 proxy application on NERSC's Cori system when using the Clang compiler, and a 15.7x speedup by switching max reductions to add reductions in the laplace mini-app when using the Cray-llvm compiler on Cori.

show abstract

OpenMP 4.5 compiler optimization for GPU offloading

Cited by 16 publications

References 1 publication

Compilation Optimization of DCU-oriented OpenMP Thread Scheduling

Compilation Optimization of DCU-oriented OpenMP Thread Scheduling

DCU oriented OpenMP offload register optimization method

Performance Assessment of OpenMP Compilers Targeting NVIDIA V100 GPUs

Contact Info

Product

Resources

About