Machine learning‐based auto‐tuning for enhanced performance portability of OpenCL applications

Summary Graphics Processing Units (GPUs) are used as accelerators for improving performance while executing highly data parallel applications. The GPUs are characterized by a number of Streaming Multiprocessors (SM) and a large number of cores within each SM. In addition to this, a hierarchy of memories with different latencies and sizes is present in the GPUs. The program execution in GPUs is thus dependent on a number of parameter values, both at compile time and runtime. To obtain the optimal performance with these GPU resources, a large parameter space is to be explored, and this leads to a number of unproductive program executions. To alleviate this difficulty, machine learning–based autotuning systems are proposed to predict the right configuration using a limited set of compile‐time parameters. In this paper, we propose a two‐stage machine learning–based autotuning framework using an expanded set of attributes. The important parameters such as block size, occupancy, eligible warps, and execution time are predicted. The mean relative error in prediction of different parameters ranges from of 16% to 6.5%. Dimensionality reduction for the features set reduces the features by up to 50% with further increase in prediction accuracy.

Section: Framework Validation and Evaluationmentioning

confidence: 59%

Section: Framework Validation and Evaluationmentioning

confidence: 59%

Section: Framework Validation and Evaluationmentioning

confidence: 66%

Section: Related Workmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

See 3 more Smart Citations

Autotuning of configuration for program execution in GPUs

Balaiah

Parthasarathi

2019

ImageCL: Language and source‐to‐source compiler for performance portability, load balancing, and scalability prediction on heterogeneous systems

Falch

Elster

2017

Self Cite

SummaryApplications written for heterogeneous CPU-GPU systems often suffer from poor performance portability. Finding good work partitions can also be challenging as different devices are suited for different applications.This article describes ImageCL, a high-level domain-specific language and source-to-source compiler, targeting single system as well as distributed heterogeneous hardware. Initially targeting image processing algorithms, our framework now also handles general stencil-based operations. It resembles OpenCL, but abstracts away performance optimization details which instead are handled by our source-to-source compiler. Machine learning-based auto-tuning is used to determine which optimizations to apply. For the distributed case, by measuring performance counters on a small input on one device, previously trained performance models are used to predict the throughput of the application on multiple different devices, making it possible to balance the load evenly.Models for the communication overhead are created in a similar fashion and used to predict the optimal number of nodes to use.ImageCL outperforms other state-of-the-art solutions on image processing benchmarks in several cases, achieving speedups of up to 4.57×. On both CPUs and GPUs we are only 3% and 2% slower than an oracle for load balancing and scalability prediction, respectively, using synthetic benchmarks.

Autotuning PolyBench benchmarks with LLVM Clang/Polly loop optimization pragmas using Bayesian optimization

Kruse

Balaprakash

et al. 2021

We develop a ytopt autotuning framework that leverages Bayesian optimization to explore the parameter space search and compare four different supervised learning methods within Bayesian optimization and evaluate their effectiveness. We select six of the most complex PolyBench benchmarks and apply the newly developed LLVM Clang/Polly loop optimization pragmas to the benchmarks to optimize them. We then use the autotuning framework to optimize the pragma parameters to improve their performance. The experimental results show that our autotuning approach outperforms the other compiling methods to provide the smallest execution time for the benchmarks syr2k, 3mm, heat‐3d, lu, and covariance with two large datasets in 200 code evaluations for effectively searching the parameter spaces with up to 170,368 different configurations. We find that the Floyd–Warshall benchmark did not benefit from autotuning. To cope with this issue, we provide some compiler option solutions to improve the performance. Then we present loop autotuning without a user's knowledge using a simple mctree autotuning framework to further improve the performance of the Floyd–Warshall benchmark. We also extend the ytopt autotuning framework to tune a deep learning application.