Automatic translation of data parallel programs for heterogeneous parallelism through OpenMP offloading

Wang, Farui; Zhang, Weizhe; Guo, Haonan; Hao, Meng; Lu, Gangzhao; Wang, Zheng

doi:10.1007/s11227-020-03452-2

Cited by 4 publications

(1 citation statement)

References 23 publications

(34 reference statements)

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Figure 4. Excerpt of DCU hardware information NVIDIA's CUDA computation is based on a 32-thread bandwidth thread bundle, which is generally expressed as Warp, and the OpenMP Offload computation for the GCN family of hardware on the DCU platform is based on a 64-thread width, called Wavefront [10] . Each CU has a SU (Scalar Unit), which is a processing unit shared by all threads in the wavefront for flow control and pointer computation, etc.…”

Section: Dcu Architecturementioning

confidence: 99%

Compilation Optimization of DCU-oriented OpenMP Thread Scheduling

Zhou

Zhao

et al. 2023

J. Phys.: Conf. Ser.

View full text Add to dashboard Cite

OpenMP is one of the mainstream parallel programming models in recent years. After version 4.0, OpenMP introduced a new target instruction to increase the functionality of heterogeneous programming, called OpenMP Offload. For the domestic heterogeneous platform DCU, the thread scheduling algorithm under OpenMP parallel computing has low performance in the default mode, which does not take the best advantage of GPU parallel computing and has wasted resources. To address this problem, this paper performs algorithm improvement at the compiler level, analyzes the available resources of the system by combining the DCU hardware facilities, then further parses the program based on its array information to get its program iteration number, reallocates the number of threads for different execution modes in OpenMP, and optimizes the thread group increase factor by combining the DCU hardware information to adjust the thread. This paper uses the SPEC ACCEL test set to optimize the number of threads in the DCU. In this paper, we use the SPEC ACCEL test set and Polybench standard test set to test the redistribution of threads and thread groups in two parallel modes using the thread scheduling optimization algorithm. The average speedup ratio of ACCEL was improved by 40%.

show abstract

Section: Dcu Architecturementioning

confidence: 99%