Automated GPU Grid Geometry Selection for OPENMP Kernels

Lloyd, Taylor; Chikin, Artem; Kedia, Sanket; Jain, Dhruv; Amaral, José Nelson

doi:10.1109/cahpc.2018.8645848

Cited by 4 publications

(2 citation statements)

References 11 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…For some programs that are not sensitive to register resources 15 , although optimization can also reduce the number of registers, its performance changes are not obvious. At the same time, both optimizations are for the case where the number of loop iterations is known.…”

Section: Future Workmentioning

confidence: 99%

DCU oriented OpenMP offload register optimization method

Chai

Gao²,

Lin³

et al. 2023

2022 2nd Conference on High Performance Computing and Communication Engineering (HPCCE 2022)

View full text Add to dashboard Cite

As a mainstream heterogeneous programming model, OpenMP has important practical significance for its uninstall performance research. Songshan supercomputer system installed in Zhengzhou Supercomputing Center is a new generation of E-class high-performance computer cluster independently developed by China, and the DCU chip installed on it is also home-made. In order to improve the offload performance of OpenMP on the platform and make full use of hardware resources such as registers, a redundant cycle optimization for thread iteration in DCU was proposed, so that the thread could release register resources in time after completing the calculation task, thus relieving the back-end register allocation pressure and improving the program performance. At the same time, based on the loop unrolling optimization algorithm in LLVM and combined with the hardware characteristics and instruction set characteristics of the domestic platform, a better algorithm for calculating the loop unrolling factor was proposed to improve the optimization effect of loop unrolling. Thread iteration optimization using SPEC ACCEL and Polybench resulted in an average 33.7% reduction in the overall register count and a 37% average performance improvement after loop expansion optimization.

show abstract

Section: Future Workmentioning

confidence: 99%

DCU oriented OpenMP offload register optimization method

Chai

Gao²,

Lin³

et al. 2023

2022 2nd Conference on High Performance Computing and Communication Engineering (HPCCE 2022)

View full text Add to dashboard Cite

show abstract

“…For example, when applying the analysis to calculate the inter-thread access stride of a GPU parallel loop, a number of iterations equal to the GPU thread-block size is tested. Existing OpenMP GPU runtimes select fixed-sized thread-block sizes based on the target GPU architecture (e.g., 128 for Pascal [17]).…”

Section: Loop Iteration Point Algebraic Differencesmentioning

confidence: 99%

Memory-access-aware Safety and Profitability Analysis for Transformation of Accelerator-bound OpenMP Loops

Chikin¹,

Lloyd

Amaral

et al. 2019

ACM Trans. Archit. Code Optim.

Self Cite

View full text Add to dashboard Cite

Iteration Point Difference Analysis is a new static analysis framework that can be used to determine the memory coalescing characteristics of parallel loops that target GPU offloading and to ascertain safety and profitability of loop transformations with the goal of improving their memory access characteristics. This analysis can propagate definitions through control flow, works for non-affine expressions, and is capable of analyzing expressions that reference conditionally defined values. This analysis framework enables safe and profitable loop transformations. Experimental results demonstrate potential for dramatic performance improvements. GPU kernel execution time across the Polybench suite is improved by up to 25.5× on an Nvidia P100 with benchmark overall improvement of up to 3.2×. An opportunity detected in a SPEC ACCEL benchmark yields kernel speedup of 86.5× with a benchmark improvement of 3.3×. This work also demonstrates how architecture-aware compilers improve code portability and reduce programmer effort.

show abstract